[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Stata 11 imputation

From   "Lachenbruch, Peter" <>
To   <>
Subject   RE: st: Stata 11 imputation
Date   Tue, 28 Jul 2009 10:30:12 -0700

Thanks for this thoughtful reply.  My problem is a little different.  In
my problem, I have some continuous (maybe 'normal') variables, some
dichotomous variables, and some categorical variables.  It looks like mi
impute will allow me to impute the normal variables and all others, but
when I want to impute the categorical variables it looks as if I will
re-impute the normal ones as categories.  I will likely need to continue
to use ICE.

BTW, I've just finished a study on variable selection with missing
values.  I imputed using ICE and then did a stepwise procedure.  It
worked very well and no matter which selection method I used, almost the
same variables were selected.  I used lars, stepwise regression,
stepwise ordered logistic regression.

Manuscript is in final revision process, so not available for


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
[] On Behalf Of Yulia
Marchenko, StataCorp LP
Sent: Monday, July 27, 2009 2:32 PM
Subject: Re: st: Stata 11 imputation

Fred Wolfe <> asks about imputing 
multiple categorical variables using -mi impute mvn- available 
as of Stata 11:

> I wonder if it might be possible in a revision of the manual to
> actually describe how to impute categorical values without having to
> purchase Allison's book (available on at a reasonable
> cost). There are a lot of "simple" examples in the manual. but no
> complex examples - somethings that would be helpful.

Before I answer Fred's specific questions, let me note that imputing
categorical variables is a difficult task in general.  Currently, there
is no
definitive recommendation in the literature to what imputation method
be used to perform this task.

Multivariate normal imputation is not designed for imputing multiple
categorical variables.  However, Allison (2000, 40) suggests an ad hoc
way of
how this can be done.  One can use a dummy representation of categorical
variables to impute the corresponding indicator variables.  For example,
if a
variable contains three categories, one will impute two indicator
corresponding to two categories, and then will compute the third
variable, corresponding to the reference category, as one minus the sum
of the
two imputed indicator variables.  The imputed indicator variables will
values on a continuous scale.  To convert them to the binary metric, you
assign 1 to an indicator variable with the largest value and 0 to the
indicator variables.  More simulation is needed to evaluate the
performance of
this method in practice.

Allison (2000) also notes that the analysis using imputed values without
rounding is superior to that which uses rounded imputed values (as
above).  Our simulations displayed similar behavior in the case of

However, if a binary or categorical _dependent_ variable is being
using a regression-based method, rounding is unavoidable.

> Would it be possible for StataCorp people to indicate on the list the
> advantages of their multivariate method compared with Royston's.

-mi impute mvn- implements a method for imputing multivariate continuous
based on Schafer (1997), which is an extension of the theoretical work
by Li
(1988).  This method is commonly referred to as NORM.  NORM assumes a
multivariate normal distribution and uses data augmentation (an
iterative MCMC
procedure) to simulate a predictive distribution from which imputed
values are

Patrick Royston's -ice- command implements imputation via chained
(ICE).  ICE uses Gibbs sampling, another MCMC procedure, to obtain
values.  ICE, however, does not assume a joint multivariate model.
it uses a set of univariate full conditional specifications.  In
these do not always lead to a proper multivariate distribution.

The main advantage of NORM is a theoretical one -- the convergence of
method to a proper posterior distribution is theoretically justified.
Theoretical justification for the chained equation approach in general
is not
as well developed in literature, mainly because the chained-equation
is not always supported by a proper underlying multivariate model; see,
example, van Buuren (2007).

The main advantage of ICE is that it is more flexible than NORM and can
directly handle non-continuous data. However, as mentioned above
to a proper multivariate distribution can be an issue.

Under the assumption of normality, ICE corresponds to a pure Gibbs
procedure and is equivalent to NORM.  The two procedures performed
in our simulation.  More simulation is needed, however, to compare the
methods for imputing binary or categorical data.


Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage.

Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical 
Computation and Simulation 30: 57--79.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca
FL: Chapman & Hall/CRC.

van Buuren, S. 2007. Multiple imputation of discrete and continuous data
fully conditional specification. Statistical Methods in Medical Research

-- Yulia
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index