Multiple imputation for missing data

Order

<- See Stata's other features

Stata’s mi command provides a full suite of multiple-imputation methods for the analysis of incomplete data, data for which some values are missing. mi provides both the imputation and the estimation steps. mi’s estimation step encompasses both estimation on individual datasets and pooling in one easy-to-use procedure. Features are provided to examine the pattern of missing values in the data. Flexible imputation methods are also provided, including nine univariate imputation methods that can be used as building blocks for multivariate imputation using chained equations, as well as multivariate normal (MVN).

mi provides easy importing of already imputed data and full imputed-data management capabilities.

Multiple imputation—estimation

We want to study the linear relationship between y and predictors x1 and x2. Our data contain missing values, however, and standard casewise deletion would result in a 40% reduction in sample size! We will fit the model using multiple imputation (MI).

First, we impute missing values and arbitrarily create five imputation datasets:

. mi impute mvn y x1 x2, add(5)
note: variable y contains no soft missing (.) values; imputing nothing

Performing EM optimization:
  observed log likelihood = -59.441984 at iteration 15

Performing MCMC data augmentation ...

Multivariate imputation                     Imputations =        5
Multivariate normal regression                    added =        5
Imputed: m=1 through m=5                        updated =        0

Prior: uniform                               Iterations =      500
                                                burn-in =      100
                                                between =      100



                Observations per m   
          Variable    Complete   Incomplete   Imputed        Total
                 y          50            0         0          50
                x1          35           15        15          50
                x2          46            4         4          50
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

That done, we can fit the model:

. mi estimate: regress y x1 x2

Multiple-imputation estimates                   Imputations       =          5
Linear regression                               Number of obs     =         50
                                                Average RVI       =     0.2488
                                                Largest FMI       =     0.2995
                                                Complete DF       =         47
DF adjustment:   Small sample                   DF:     min       =      20.88
                                                        avg       =      27.58
                                                        max       =      35.41
Model F test:       Equal FMI                   F(   2,   25.5)   =      11.90
Within VCE type:          OLS                   Prob > F          =     0.0002




           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

          x1     .4079375    .172301     2.37   0.028     .0494925    .7663824
          x2     .7211742   .1855085     3.89   0.000     .3447275    1.097621
       _cons    -.1526739   .1709024    -0.89   0.380    -.5036782    .1983304

mi estimate fits the specified model (linear regression here) on each of the imputation datasets (five here) and then combines the results into one MI inference.

Multiple imputation—nuts and bolts

mi can import already imputed data from NHANES or ice, or you can start with original data and form imputations yourself.

Either way, dealing with the multiple copies of the data is the bane of MI analysis. mi solves that problem. mi organizes the data in one of four formats, called wide, mlong, flong, and flongsep. In flongsep format, each imputation dataset is its own file. In the other formats, the data are combined into one dataset. Each format has its advantages, and mi makes it easy to switch formats. You can type or click one command to switch your data from one format to another. You can work with the data organized one way, continue with the data organized another way, and so always work with the most convenient organization.

All mi commands work with all data formats.

Full data management is provided, too. You can create variables, drop variables, or create and drop observations as if you were working with one dataset, leaving it to mi to duplicate the changes correctly over each of the imputation datasets. You can merge your MI data with other datasets, both regular and MI, or append them, or copy the imputed values from one dataset to another. If you are analyzing survival data, you can split or join time periods just as you would ordinarily. The same applies if you are working with panel data and want to reshape your data. The fact that the actions you take might need to be carried out consistently over 5, 50, or even 500 datasets is irrelevant.

Multiple Imputation—Control Panel

mi’s Control Panel will guide you through all the phases of MI.

The Control Panel unifies many of mi’s capabilities into one flexible user interface. It guides you from the very beginning of your MI working session—examining missing values and their patterns—to the very end of it—performing MI inference.

Use the Examine tools to check missing-value patterns and to determine the appropriate imputation method.

Move on to Setup to set up your data for use by mi.

Need to create imputations? Use Impute.

Already have imputations? Skip Setup and go directly to Import to import your already imputed data.

To create new variables, merge or reshape your data, or use other data-management commands with mi data, go to Manage.

When you are ready, use Estimate to choose a model for your analysis. A set of dialog tabs will help you easily build your MI estimation model.

The Test and Predict panels let you finish your analysis by performing tests of hypotheses and computing MI predictions.

Multiple imputation—capabilities

Imputation

Impute missing values of a single variable using one of nine univariate methods:

linear regression (fully parametric) for continuous variables
predictive mean matching (semiparametric) for continuous variables
truncated regression for continuous variables with a restricted range
interval regression for censored continuous variables
logistic for binary variables
ordered logistic for ordinal variables
multinomial (polytomous) logistic for nominal variables
Poisson for count variables
negative binomial for overdispersed count variables

Impute missing values of multiple variables of different types with an arbitrary missing-value pattern using chained equations.

Use any of the nine methods above to build a flexible imputation model. (You could impute x1 and x2 jointly using predictive mean matching for x1 and ordered logistic for x2.)

Customize prediction equations for imputed variables (such as omitting z2 from the model for x1).

Impute missing values using different observations for different variables (such as imputing missing values of number of cigarettes smoked per day using only current smokers while using all observations to impute weight). You can do this even when the smoking status is missing for some observations and is being imputed itself.

Allow general expressions of imputed variables in the equations for later imputed variables (impute x1 and include x1² in x2’s imputation model).
Impute missing values of multiple continuous variables with an arbitrary missing-value pattern using an MVN model, allowing full or conditional model specification. Three prior specifications are provided.
Update missing values even after you have already imputed some of them, including increasing the number of imputed datasets.
Impute missing values using weighted and survey-weighted data with all the above techniques except MVN.
Perform conditional imputation with all the above techniques except MVN (restrict imputation of number of pregnancies to females even when female itself contains missing values and so is being imputed.)
Impute missing values separately for different groups of the data.

Estimation

In one simple step, perform both individual estimations and pooling of results.
Fit models with most Stata estimation commands, including survival-data regression models, survey-data regression models, and panel and multilevel regression models.
Obtain MI estimates of transformed parameters.
Obtain MI estimates from previously saved individual estimation results.
Obtain detailed information about MI characteristics, including relative efficiency, simulation error, and fraction of missing information due to nonresponse.
Estimate the amount of simulation error in your final model, so you can decide whether you need more imputations.
Estimate with community-contributed estimators.
mi verifies the integrity of the estimation model across imputations (consistency of estimation samples and omitted variables, model convergence) and notifies you if a problem exists.

Postestimation

Perform tests on multiple coefficients simultaneously.
Tests available under the assumptions of equal and unequal fractions of missing information.
Small-sample adjustments.
Compute linear and nonlinear predictions after MI estimation.

Explore more about multiple imputation in Stata.

Multiple imputation for missing data

<- See Stata's other features

Multiple imputation—estimation

Multiple imputation—nuts and bolts

Multiple Imputation—Control Panel

Multiple imputation—capabilities

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

	Observations per m
Variable	Complete Incomplete Imputed	Total
y	50 0 0	50
x1	35 15 15	50
x2	46 4 4	50


y		Coefficient Std. err. t P>\|t\| [95% conf. interval]

x1		.4079375 .172301 2.37 0.028 .0494925 .7663824
x2		.7211742 .1855085 3.89 0.000 .3447275 1.097621
_cons		-.1526739 .1709024 -0.89 0.380 -.5036782 .1983304