# Re: st: Stata 11 imputation

 From ymarchenko@stata.com (Yulia Marchenko, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: Stata 11 imputation Date Mon, 27 Jul 2009 16:32:29 -0500

```Fred Wolfe <fwolfe@arthritis-research.org> asks about imputing
multiple categorical variables using -mi impute mvn- available
as of Stata 11:

> I wonder if it might be possible in a revision of the manual to
> actually describe how to impute categorical values without having to
> purchase Allison's book (available on Amazon.com at a reasonable
> cost). There are a lot of "simple" examples in the manual. but no
> complex examples - somethings that would be helpful.

Before I answer Fred's specific questions, let me note that imputing multiple
categorical variables is a difficult task in general.  Currently, there is no
definitive recommendation in the literature to what imputation method should
be used to perform this task.

Multivariate normal imputation is not designed for imputing multiple
categorical variables.  However, Allison (2000, 40) suggests an ad hoc way of
how this can be done.  One can use a dummy representation of categorical
variables to impute the corresponding indicator variables.  For example, if a
variable contains three categories, one will impute two indicator variables,
corresponding to two categories, and then will compute the third indicator
variable, corresponding to the reference category, as one minus the sum of the
two imputed indicator variables.  The imputed indicator variables will contain
values on a continuous scale.  To convert them to the binary metric, you
assign 1 to an indicator variable with the largest value and 0 to the other
indicator variables.  More simulation is needed to evaluate the performance of
this method in practice.

Allison (2000) also notes that the analysis using imputed values without
rounding is superior to that which uses rounded imputed values (as described
above).  Our simulations displayed similar behavior in the case of binary
predictors.

However, if a binary or categorical _dependent_ variable is being imputed
using a regression-based method, rounding is unavoidable.

> Would it be possible for StataCorp people to indicate on the list the
> advantages of their multivariate method compared with Royston's.

-mi impute mvn- implements a method for imputing multivariate continuous data
based on Schafer (1997), which is an extension of the theoretical work by Li
(1988).  This method is commonly referred to as NORM.  NORM assumes a joint
multivariate normal distribution and uses data augmentation (an iterative MCMC
procedure) to simulate a predictive distribution from which imputed values are
drawn.

Patrick Royston's -ice- command implements imputation via chained equations
(ICE).  ICE uses Gibbs sampling, another MCMC procedure, to obtain imputed
values.  ICE, however, does not assume a joint multivariate model.  Instead,
it uses a set of univariate full conditional specifications.  In general,
these do not always lead to a proper multivariate distribution.

The main advantage of NORM is a theoretical one -- the convergence of the
method to a proper posterior distribution is theoretically justified.
Theoretical justification for the chained equation approach in general is not
as well developed in literature, mainly because the chained-equation approach
is not always supported by a proper underlying multivariate model; see, for
example, van Buuren (2007).

The main advantage of ICE is that it is more flexible than NORM and can more
directly handle non-continuous data. However, as mentioned above convergence
to a proper multivariate distribution can be an issue.

Under the assumption of normality, ICE corresponds to a pure Gibbs sampling
procedure and is equivalent to NORM.  The two procedures performed comparably
in our simulation.  More simulation is needed, however, to compare the two
methods for imputing binary or categorical data.

References:

Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage.

Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical
Computation and Simulation 30: 57--79.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton,
FL: Chapman & Hall/CRC.

van Buuren, S. 2007. Multiple imputation of discrete and continuous data by
fully conditional specification. Statistical Methods in Medical Research 16:
219--242.

-- Yulia
ymarchenko@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```