Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Stata 11 imputation


From   Fred Wolfe <fwolfe@arthritis-research.org>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Stata 11 imputation
Date   Tue, 28 Jul 2009 06:21:11 -0500

Thank you very much for your thorough reply. I do hope that Stata
continues to actively support ICE. Stata will be a richer package if
both current major methods are available and supported.

Fred

On Mon, Jul 27, 2009 at 4:32 PM, Yulia Marchenko, StataCorp
LP<ymarchenko@stata.com> wrote:
> Fred Wolfe <fwolfe@arthritis-research.org> asks about imputing
> multiple categorical variables using -mi impute mvn- available
> as of Stata 11:
>
>> I wonder if it might be possible in a revision of the manual to
>> actually describe how to impute categorical values without having to
>> purchase Allison's book (available on Amazon.com at a reasonable
>> cost). There are a lot of "simple" examples in the manual. but no
>> complex examples - somethings that would be helpful.
>
> Before I answer Fred's specific questions, let me note that imputing multiple
> categorical variables is a difficult task in general.  Currently, there is no
> definitive recommendation in the literature to what imputation method should
> be used to perform this task.
>
> Multivariate normal imputation is not designed for imputing multiple
> categorical variables.  However, Allison (2000, 40) suggests an ad hoc way of
> how this can be done.  One can use a dummy representation of categorical
> variables to impute the corresponding indicator variables.  For example, if a
> variable contains three categories, one will impute two indicator variables,
> corresponding to two categories, and then will compute the third indicator
> variable, corresponding to the reference category, as one minus the sum of the
> two imputed indicator variables.  The imputed indicator variables will contain
> values on a continuous scale.  To convert them to the binary metric, you
> assign 1 to an indicator variable with the largest value and 0 to the other
> indicator variables.  More simulation is needed to evaluate the performance of
> this method in practice.
>
> Allison (2000) also notes that the analysis using imputed values without
> rounding is superior to that which uses rounded imputed values (as described
> above).  Our simulations displayed similar behavior in the case of binary
> predictors.
>
> However, if a binary or categorical _dependent_ variable is being imputed
> using a regression-based method, rounding is unavoidable.
>
>> Would it be possible for StataCorp people to indicate on the list the
>> advantages of their multivariate method compared with Royston's.
>
> -mi impute mvn- implements a method for imputing multivariate continuous data
> based on Schafer (1997), which is an extension of the theoretical work by Li
> (1988).  This method is commonly referred to as NORM.  NORM assumes a joint
> multivariate normal distribution and uses data augmentation (an iterative MCMC
> procedure) to simulate a predictive distribution from which imputed values are
> drawn.
>
> Patrick Royston's -ice- command implements imputation via chained equations
> (ICE).  ICE uses Gibbs sampling, another MCMC procedure, to obtain imputed
> values.  ICE, however, does not assume a joint multivariate model.  Instead,
> it uses a set of univariate full conditional specifications.  In general,
> these do not always lead to a proper multivariate distribution.
>
> The main advantage of NORM is a theoretical one -- the convergence of the
> method to a proper posterior distribution is theoretically justified.
> Theoretical justification for the chained equation approach in general is not
> as well developed in literature, mainly because the chained-equation approach
> is not always supported by a proper underlying multivariate model; see, for
> example, van Buuren (2007).
>
> The main advantage of ICE is that it is more flexible than NORM and can more
> directly handle non-continuous data. However, as mentioned above convergence
> to a proper multivariate distribution can be an issue.
>
> Under the assumption of normality, ICE corresponds to a pure Gibbs sampling
> procedure and is equivalent to NORM.  The two procedures performed comparably
> in our simulation.  More simulation is needed, however, to compare the two
> methods for imputing binary or categorical data.
>
>
> References:
>
> Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage.
>
> Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical
> Computation and Simulation 30: 57--79.
>
> Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton,
> FL: Chapman & Hall/CRC.
>
> van Buuren, S. 2007. Multiple imputation of discrete and continuous data by
> fully conditional specification. Statistical Methods in Medical Research 16:
> 219--242.
>
>
> -- Yulia
> ymarchenko@stata.com
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
Fred Wolfe
National Data Bank for Rheumatic Diseases
Wichita, Kansas
NDB Office  +1 316 263 2125 Ext 0
Research Office +1 316 686 9195
fwolfe@arthritis-research.org

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index