[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Fred Wolfe <fwolfe@arthritis-research.org> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Stata 11 imputation |

Date |
Tue, 28 Jul 2009 06:21:11 -0500 |

Thank you very much for your thorough reply. I do hope that Stata continues to actively support ICE. Stata will be a richer package if both current major methods are available and supported. Fred On Mon, Jul 27, 2009 at 4:32 PM, Yulia Marchenko, StataCorp LP<ymarchenko@stata.com> wrote: > Fred Wolfe <fwolfe@arthritis-research.org> asks about imputing > multiple categorical variables using -mi impute mvn- available > as of Stata 11: > >> I wonder if it might be possible in a revision of the manual to >> actually describe how to impute categorical values without having to >> purchase Allison's book (available on Amazon.com at a reasonable >> cost). There are a lot of "simple" examples in the manual. but no >> complex examples - somethings that would be helpful. > > Before I answer Fred's specific questions, let me note that imputing multiple > categorical variables is a difficult task in general. Currently, there is no > definitive recommendation in the literature to what imputation method should > be used to perform this task. > > Multivariate normal imputation is not designed for imputing multiple > categorical variables. However, Allison (2000, 40) suggests an ad hoc way of > how this can be done. One can use a dummy representation of categorical > variables to impute the corresponding indicator variables. For example, if a > variable contains three categories, one will impute two indicator variables, > corresponding to two categories, and then will compute the third indicator > variable, corresponding to the reference category, as one minus the sum of the > two imputed indicator variables. The imputed indicator variables will contain > values on a continuous scale. To convert them to the binary metric, you > assign 1 to an indicator variable with the largest value and 0 to the other > indicator variables. More simulation is needed to evaluate the performance of > this method in practice. > > Allison (2000) also notes that the analysis using imputed values without > rounding is superior to that which uses rounded imputed values (as described > above). Our simulations displayed similar behavior in the case of binary > predictors. > > However, if a binary or categorical _dependent_ variable is being imputed > using a regression-based method, rounding is unavoidable. > >> Would it be possible for StataCorp people to indicate on the list the >> advantages of their multivariate method compared with Royston's. > > -mi impute mvn- implements a method for imputing multivariate continuous data > based on Schafer (1997), which is an extension of the theoretical work by Li > (1988). This method is commonly referred to as NORM. NORM assumes a joint > multivariate normal distribution and uses data augmentation (an iterative MCMC > procedure) to simulate a predictive distribution from which imputed values are > drawn. > > Patrick Royston's -ice- command implements imputation via chained equations > (ICE). ICE uses Gibbs sampling, another MCMC procedure, to obtain imputed > values. ICE, however, does not assume a joint multivariate model. Instead, > it uses a set of univariate full conditional specifications. In general, > these do not always lead to a proper multivariate distribution. > > The main advantage of NORM is a theoretical one -- the convergence of the > method to a proper posterior distribution is theoretically justified. > Theoretical justification for the chained equation approach in general is not > as well developed in literature, mainly because the chained-equation approach > is not always supported by a proper underlying multivariate model; see, for > example, van Buuren (2007). > > The main advantage of ICE is that it is more flexible than NORM and can more > directly handle non-continuous data. However, as mentioned above convergence > to a proper multivariate distribution can be an issue. > > Under the assumption of normality, ICE corresponds to a pure Gibbs sampling > procedure and is equivalent to NORM. The two procedures performed comparably > in our simulation. More simulation is needed, however, to compare the two > methods for imputing binary or categorical data. > > > References: > > Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage. > > Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical > Computation and Simulation 30: 57--79. > > Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton, > FL: Chapman & Hall/CRC. > > van Buuren, S. 2007. Multiple imputation of discrete and continuous data by > fully conditional specification. Statistical Methods in Medical Research 16: > 219--242. > > > -- Yulia > ymarchenko@stata.com > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- Fred Wolfe National Data Bank for Rheumatic Diseases Wichita, Kansas NDB Office +1 316 263 2125 Ext 0 Research Office +1 316 686 9195 fwolfe@arthritis-research.org * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Stata 11 imputation***From:*Yulia Marchenko <ymarchenko@stata.com>

**References**:**Re: st: Stata 11 imputation***From:*ymarchenko@stata.com (Yulia Marchenko, StataCorp LP)

- Prev by Date:
**st: re: overidentifying restrictions** - Next by Date:
**st: merge not full identified datasets** - Previous by thread:
**Re: st: Stata 11 imputation** - Next by thread:
**Re: st: Stata 11 imputation** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |