Title | Combining results other than coefficients in e(b) with multiply imputed data | |

Authors | Isabel Canette and Yulia Marchenko, StataCorp |

mi
estimate estimates parameters from multiply imputed data and adjusts
these estimates and their respective standard errors for the imputation
uncertainty using Rubin’s combination rules. **mi estimate** is designed
to work with Stata estimation commands. As such, it combines the estimates of
coefficients, which are stored in matrix **e(b)**, and their respective
variance–covariance estimates (VCE), stored in matrix **e(V)**. You may
also want to combine results other than **e(b)**. The aim of this
tutorial is to show you how to do that.

This entry is organized as follows:

- Finding out where your results are saved
- Combining point estimates only
- Combining point estimates and their variances
- An example using suest

First, you need to know where your results are saved. Most Stata commands
save results either in **e()** or **r()**. You can type

. ereturn list

or

. return list

to find out where your value of interest is saved after running a command.
Alternatively, you can look at the *Stored results* section in the help file or
documentation for the command.

In some situations, we are interested in a point estimate, such as the R-squared
(R^{2}) value from a regression, not in its variance. According to
Rubin's rules, the estimate of the value of interest should be computed for
each imputation, and the overall value will be the mean of these estimates.
We can do this manually, taking advantage of
mi xeq,
which allows you to run sequences of commands of interest on each
individual imputation.

In the following example, results are combined for the
R^{2} value from a linear regression. We use this example
strictly for the purpose of illustration. See the unofficial command
**mibeta** (type **. search mibeta** to locate and install this
command), which automatically provides R^{2}, adjusted R^{2}, and
standardized coefficients after regression for imputed data.

We will use the dataset from the example of house resale prices in
[MI]
**mi estimate**. Our first step is to identify where our result of interest
is stored. The **regress** command is an e-class command and it
stores R^{2} in **e(r2)**:

.webuse mhouses1993s30(Albuquerque home prices Feb15–Apr30, 1993) .mi xeq 0: regress price tax sqft age nfeatures ne custom cornerm=0 data: -> regress price tax sqft age nfeatures ne custom corner

Source | SS df MS | Number of obs = 66 | |

F(7, 58) = 51.86 | |||

Model | 9164658.61 7 1309236.94 | Prob > F = 0.0000 | |

Residual | 1464105.15 58 25243.1922 | R-squared = 0.8623 | |

Adj R-squared = 0.8456 | |||

Total | 10628763.8 65 163519.442 | Root MSE = 158.88 |

price | Coefficient Std. err. t P>|t| [95% conf. interval] | |

tax | .4988701 .1584853 3.15 0.003 .1816273 .8161128 | |

sqft | .3522184 .0957476 3.68 0.001 .1605588 .5438779 | |

age | -.5650817 2.002529 -0.28 0.779 -4.57358 3.443416 | |

nfeatures | 4.389607 18.55499 0.24 0.814 -32.75223 41.53145 | |

ne | -17.38534 47.27462 -0.37 0.714 -112.0158 77.2451 | |

custom | 174.9411 53.72371 3.26 0.002 67.40139 282.4808 | |

corner | -73.58234 49.13007 -1.50 0.140 -171.9269 24.76218 | |

_cons | 92.7448 101.607 0.91 0.365 -110.6438 296.1334 | |

We used **mi xeq 0:** in the command line above to execute the **regress** command
on the original data ** m=0**.

Now we want to compute the average of the R^{2} values. To do so, we
can use **mi xeq:** to run regressions on each imputed dataset and
collect individual R^{2} values:

.mi querydata mi set mlong, M = 30 .local M = r(M).scalar r2 = 0.mi xeq 1/`M': regress price tax sqft age nfeatures ne custom corner; > scalar r2 = r2 + e(r2)m=1 data: -> regress price tax sqft age nfeatures ne custom corner

Source | SS df MS | Number of obs = 117 | |

F(7, 109) = 74.79 | |||

Model | 13895850.5 7 1985121.5 | Prob > F = 0.0000 | |

Residual | 2893096.28 109 26542.1677 | R-squared = 0.8277 | |

Adj R-squared = 0.8166 | |||

Total | 16788946.8 116 144732.3 | Root MSE = 162.92 |

price | Coefficient Std. err. t P>|t| [95% conf. interval] | |

tax | .676841 .1140902 5.93 0.000 .450718 .9029641 | |

sqft | .2091065 .065971 3.17 0.002 .0783542 .3398589 | |

age | -.2350972 1.472105 -0.16 0.873 -3.152762 2.682568 | |

nfeatures | 4.17071 12.572 0.33 0.741 -20.74658 29.088 | |

ne | 14.4082 33.3729 0.43 0.667 -51.73581 80.55221 | |

custom | 137.1073 41.98151 3.27 0.001 53.90127 220.3132 | |

corner | -75.24024 39.1843 -1.92 0.057 -152.9023 2.421776 | |

_cons | 149.6721 66.01572 2.27 0.025 18.83105 280.5131 | |

Source | SS df MS | Number of obs = 117 | |

F(7, 109) = 71.60 | |||

Model | 13790079.1 7 1970011.3 | Prob > F = 0.0000 | |

Residual | 2998867.71 109 27512.5478 | R-squared = 0.8214 | |

Adj R-squared = 0.8099 | |||

Total | 16788946.8 116 144732.3 | Root MSE = 165.87 |

price | Coefficient Std. err. t P>|t| [95% conf. interval] | |

tax | .6938792 .1261562 5.50 0.000 .4438417 .9439168 | |

sqft | .2044596 .0710267 2.88 0.005 .0636869 .3452322 | |

age | .3372971 1.591275 0.21 0.833 -2.816558 3.491152 | |

nfeatures | 6.940631 13.04224 0.53 0.596 -18.90865 32.78992 | |

ne | 7.280361 36.79407 0.20 0.844 -65.6443 80.20502 | |

custom | 132.975 43.24393 3.07 0.003 47.26696 218.6831 | |

corner | -70.26249 39.74756 -1.77 0.080 -149.0409 8.515876 | |

_cons | 128.2627 67.20399 1.91 0.059 -4.933393 261.4588 | |

We used **mi query** above to identify the number of imputations and to store it
in the local macro **M**. Then we used **mi xeq 1/`M':** to run
regressions on the imputed data and accumulate the R^{2} values in the
**r2** scalar. We used a semicolon to separate the two commands
we wished to execute on each imputed datum; see [MI] **mi xeq** for
details. Finally, to obtain the average, we divided **r2** by the number
of imputations.

We recommend applying Rubin’s combination rules to
parameters in a metric for which the asymptotic normal approximation works
well. Fisher’s *z*, or inverse hyperbolic tangent [**atanh()**],
transformation is often recommended for the correlation coefficient R to
improve its asymptotic normality. Thus we can combine the R^{2}
values in the transformed metric and then use the inverse transformation,
**tanh()**, to switch back to the original metric.

.local M = 30.scalar r2 = 0.qui mi xeq 1/`M': regress price tax sqft age nfeatures ne custom corner; scalar > r2 = r2 + atanh(sqrt(e(r2))).scalar r2 = tanh(r2/`M')^2.di as txt "R2 using Fisher's z over imputed data = " as res r2R2 using Fisher's z over imputed data = .82306058

The two estimates of the R^{2} are very similar.

If we want to combine point estimates and their variances, we can
use **mi estimate**. However, we need to create an e-class program
(see program)
that saves the necessary results where **mi estimate** expects to see
them. The general guidelines are described in
program properties.

Below we demonstrate how to use **mi estimate** to combine results from
r-class command
roctab using
the heart attack example from [MI] **mi estimate**. The **roctab**
command posts the estimate of the area under the curve (AUC) in
**r(area)** and its standard error in **r(se)**:

.webuse mheart1s20(Fictional heart attack data; BMI missing) .qui mi passive: generate high_bmi = (bmi>30) if bmi<..mi xeq 0: roctab attack high_bmim=0 data: -> roctab attack high_bmi

ROC Asymptotic normal | |||

Obs area Std. err. [95% conf. interval] | |||

132 0.5452 0.0322 0.48208 0.60833 |

In the above series of commands, we created a new variable indicating high
body mass index (BMI) values,
**high_bmi**, which we use as a classification variable in the
receiver operating characteristic (ROC)
analysis of heart attacks. Because **bmi** is the imputed variable,
**high_bmi** (being a function of **bmi**) is a passive variable. As
such, we used **mi passive: generate** to create the passive variable
**high_bmi**; see
mi register
and mi passive
for details.

Following the guidelines in
program properties,
we need to create an e-class program that stores the AUC estimate and
its standard error in **e(b)** and **e(V)**. We also need to store
other results, such as the name of the command in **e(cmd)**, the number
of observations in **e(N)**, and the title of the command in
**e(title)**.

In the following code, the e-class program **eroctab** is defined.
**eroctab** calls **roctab** and posts necessary results to
**e()**; see ereturn
for details. This program accepts two arguments: the name of the reference
variable (stored in the local macro **refvar**) and the name of the
classification variable (stored in the local macro **classvar**). In step
1, we perform ROC analysis of variables in **refvar** and **classvar**
using **roctab**. In step 2, we save the estimates of the area and its
variance in temporary matrices **b** and **V**, respectively. Then
in step 3, we label columns of coefficient matrix **`b'** and rows and
columns of the VCE matrix **`V'** consistently, as required by **ereturn
post**. Finally, in step 4, we post coefficient and VCE matrices as well
as other results to **e()**.

cap program drop eroctab program eroctab, eclass version 12.0 /* Step 1: perform ROC analysis */ args refvar classvar roctab `refvar' `classvar' /* Step 2: save estimate and its variance in temporary matrices*/ tempname b V mat `b' = r(area) mat `V' = r(se)^2 local N = r(N) /* Step 3: make column names and row names consistent*/ mat colnames `b' = AUC mat colnames `V' = AUC mat rownames `V' = AUC /*Step 4: post results to e()*/ ereturn post `b' `V', obs(`N') ereturn local cmd "eroctab" ereturn local title "ROC area" end

Inference about the AUC estimate in **roctab** is based on the
large-sample normal approximation. Sometimes commands, for example **regress**,
adjust for small samples and use Student’s *t* distribution for inference. In
such cases, the corresponding (residual or denominator) degrees of freedom
must be posted to **e(df_r)**. This can be done by specifying the
**dof()** option with **ereturn post**.

Now we can use **eroctab** with the **mi estimate:** prefix to obtain
multiple-imputation estimates of the ROC area:

.mi estimate, cmdok: eroctab attack high_bmi

Multiple-imputation estimates Imputations = 20 |

ROC area Number of obs = 154 |

Average RVI = 0.0753 |

Largest FMI = 0.0705 |

DF adjustment: Large sample DF: min = 3,873.11 |

avg = 3,873.11 |

max = 3,873.11 |

Coefficient Std. err. t P>|t| [95% conf. interval] | ||

AUC | .5450938 .0312929 17.42 0.000 .4837416 .6064459 | |

In the command line above, we used the **cmdok** option with **mi estimate** because
**eroctab** is not one of the
estimation
commands officially supported by **mi estimate**. Alternatively, we
could have included the **mi** property in the definition of the program
**eroctab** to use it with **mi estimate** directly:

cap program drop eroctab program eroctab, eclass properties(mi) ... end .mi estimate: eroctab attack high_bmi

Another situation when you need to define your own program is when your results are not obtained via a single command, but by using a sequence of commands such as, for example, with suest.

Below we demonstrate an example in which we want to compare the effects of smoking on heart attacks from two logistic models: one adjusting for both age and BMI and the other adjusting only for BMI.

First, we create a new program, **mysuest**, which fits two models of
interest and then uses **suest** to combine the two estimation results.
The models of interest are passed as two arguments and are stored in local
macros **model1** and **model2**. To avoid specifying the
**cmdok** option with **mi estimate**, we add **properties(mi)**
to the definition of the **mysuest** program.

cap program drop mysuest program mysuest, eclass properties(mi) version 12.0 args model1 model2 qui `model1' estimates store est1 qui `model2' estimates store est2 suest est1 est2 estimates drop est1 est2 ereturn local title "Seemingly unrelated estimation" end

Because **suest** is an estimation command and already stores the
appropriate results in **e()**, we did not need to manually post (or repost)
anything to **e()** except **e(title)**, which is not stored
by **suest**.

We can now use **mysuest** with **mi estimate**:

.webuse mheart1s20, clear(Fictional heart attack data; BMI missing) .mi estimate: mysuest "logit attack smokes age bmi" "logit attack smokes bmi"Multiple-imputation estimates Imputations = 20 Seemingly unrelated estimation Number of obs = 154 Average RVI = 0.0541 Largest FMI = 0.1568 DF adjustment: Large sample DF: min = 794.17 avg = 234,946.42 Within VCE type: Robust max = 667,104.68

Coefficient Std. err. t P>|t| [95% conf. interval] | ||

est1_attack | ||

smokes | 1.187115 .3574611 3.32 0.001 .486502 1.887728 | |

age | .0352896 .0162749 2.17 0.030 .0033913 .067188 | |

bmi | .102868 .0478936 2.15 0.032 .0089024 .1968336 | |

_cons | -5.312597 1.69795 -3.13 0.002 -8.641634 -1.983561 | |

est2_attack | ||

smokes | 1.172281 .3496687 3.35 0.001 .4869414 1.85762 | |

bmi | .0921183 .0470325 1.96 0.051 -.0002044 .1844411 | |

_cons | -3.038422 1.229553 -2.47 0.014 -5.451692 -.6251518 | |

Recall that within multiple-imputation framework, to test the equality of
coefficients, we must first estimate their difference and then use
mi
testtransform to test the hypothesis; see [MI] **mi test** for details.

So, to test the coefficients on **smokes** from the two logistic models,
we first estimate their difference with **mi estimate**. To display only
transformed results, we specify the **nocoef** option:

.mi estimate (diff: [est1_attack]smokes - [est2_attack]smokes), nocoef: > mysuest "logit attack smokes age bmi" "logit attack smokes bmi"Multiple-imputation estimates Imputations = 20 Seemingly unrelated estimation Number of obs = 154 Average RVI = 0.0047 Largest FMI = 0.0047 DF adjustment: Large sample DF: min = 866,779.11 avg = 866,779.11 Within VCE type: Robust max = 866,779.11 diff: [est1_attack]smokes - [est2_attack]smokes

Coefficient Std. err. t P>|t| [95% conf. interval] | ||

diff | .0148342 .079597 0.19 0.852 -.1411732 .1708417 | |

Although the *t* test, which is automatically reported by **mi estimate**, is
sufficient for testing the hypothesis of no difference between the two
coefficient estimates of **smokes**, we can also use the following **mi
testtransform** command to test this hypothesis:

.mi testtransform diffnote: assuming equal fractions of missing information. diff: [est1_attack]smokes - [est2_attack]smokes ( 1) diff = 0 F( 1,866779.1) = 0.03 Prob > F = 0.8522

We do not have sufficient evidence to reject the null hypothesis of the equality of the smoking effects from the considered logistic models.

We could easily extend our **mysuest** program to allow more than two
models to be combined by **suest**.