[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
ymarchenko@stata.com (Yulia Marchenko, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Stata 11 imputation |

Date |
Mon, 27 Jul 2009 16:32:29 -0500 |

Fred Wolfe <fwolfe@arthritis-research.org> asks about imputing multiple categorical variables using -mi impute mvn- available as of Stata 11: > I wonder if it might be possible in a revision of the manual to > actually describe how to impute categorical values without having to > purchase Allison's book (available on Amazon.com at a reasonable > cost). There are a lot of "simple" examples in the manual. but no > complex examples - somethings that would be helpful. Before I answer Fred's specific questions, let me note that imputing multiple categorical variables is a difficult task in general. Currently, there is no definitive recommendation in the literature to what imputation method should be used to perform this task. Multivariate normal imputation is not designed for imputing multiple categorical variables. However, Allison (2000, 40) suggests an ad hoc way of how this can be done. One can use a dummy representation of categorical variables to impute the corresponding indicator variables. For example, if a variable contains three categories, one will impute two indicator variables, corresponding to two categories, and then will compute the third indicator variable, corresponding to the reference category, as one minus the sum of the two imputed indicator variables. The imputed indicator variables will contain values on a continuous scale. To convert them to the binary metric, you assign 1 to an indicator variable with the largest value and 0 to the other indicator variables. More simulation is needed to evaluate the performance of this method in practice. Allison (2000) also notes that the analysis using imputed values without rounding is superior to that which uses rounded imputed values (as described above). Our simulations displayed similar behavior in the case of binary predictors. However, if a binary or categorical _dependent_ variable is being imputed using a regression-based method, rounding is unavoidable. > Would it be possible for StataCorp people to indicate on the list the > advantages of their multivariate method compared with Royston's. -mi impute mvn- implements a method for imputing multivariate continuous data based on Schafer (1997), which is an extension of the theoretical work by Li (1988). This method is commonly referred to as NORM. NORM assumes a joint multivariate normal distribution and uses data augmentation (an iterative MCMC procedure) to simulate a predictive distribution from which imputed values are drawn. Patrick Royston's -ice- command implements imputation via chained equations (ICE). ICE uses Gibbs sampling, another MCMC procedure, to obtain imputed values. ICE, however, does not assume a joint multivariate model. Instead, it uses a set of univariate full conditional specifications. In general, these do not always lead to a proper multivariate distribution. The main advantage of NORM is a theoretical one -- the convergence of the method to a proper posterior distribution is theoretically justified. Theoretical justification for the chained equation approach in general is not as well developed in literature, mainly because the chained-equation approach is not always supported by a proper underlying multivariate model; see, for example, van Buuren (2007). The main advantage of ICE is that it is more flexible than NORM and can more directly handle non-continuous data. However, as mentioned above convergence to a proper multivariate distribution can be an issue. Under the assumption of normality, ICE corresponds to a pure Gibbs sampling procedure and is equivalent to NORM. The two procedures performed comparably in our simulation. More simulation is needed, however, to compare the two methods for imputing binary or categorical data. References: Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage. Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical Computation and Simulation 30: 57--79. Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton, FL: Chapman & Hall/CRC. van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 16: 219--242. -- Yulia ymarchenko@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Stata 11 imputation***From:*"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu>

**Re: st: Stata 11 imputation***From:*Fred Wolfe <fwolfe@arthritis-research.org>

- Prev by Date:
**st: RE: st: -set memory- in Stata 11** - Next by Date:
**Re: st: Dealing with survey data when the entire population is also in the dataset** - Previous by thread:
**Re: st: Stata 11 imputation** - Next by thread:
**Re: st: Stata 11 imputation** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |