Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Advice on multiple imputation in Stata


From   "Verkuilen, Jay" <JVerkuilen@gc.cuny.edu>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Advice on multiple imputation in Stata
Date   Wed, 16 Sep 2009 19:12:14 -0400

Thomas Klausch wrote: 

>>I am working with a set of survey data (n is about 1000) which contains 36 items with missing values (min 0.1% missing, up to 16% max). The items are surveyed on a seven point scale, so they are, strictly speaking, on a categorical measurement level. I have only a little theoretical knowledge on imputation techniques, but virtually no practical experience. What I would like to do in Stata is something like a multiple imputation using EM while considering the items metric. Or I would like to use some logistic / probit link function in a mulinomial model to impute categorical variables directly.<<

There are a lot of decisions that one needs to make in this sort of thing and you're right to expect benefit from practical experience---I liken myself to a radiologist reading a CAT scan when I discuss it with collaborators. Having now done several missing data analyses "for real" I keep finding things I didn't know, expect, or didn't quite line up with textbook advice. In particular the "more is merrier" principle often found in your tutorial papers can be quite wrong. (The other issue is the fact that Rubin's original advice on the number of imputations you need seems to be way too low.) Cases that are dubious (e.g., subjects where you're not confident the experimental protocol was followed) can REALLY add noise to your data and kill your power. Furthermore, multicollinearity is often a problem and thus loading up your dataset with redundant predictors will lead to you having issues with convergence and/or require you to impose a ridge prior which will kill your power if !
 you are not careful. 

Your situation sounds pretty ideal as you don't have huge amounts of missing data and have a reasonable number of cases to back up the number of variables you have. MI via MCMC will probably work pretty well here. Usually it doesn't make a huge difference whether you treat a categorical variable as continuous or not, so long as the categorical variables aren't markedly non-normal (e.g., very skewed or bimodal). If you are using procedures such as -ologit- that assume categorical data inputs, you can simply round off the imputations to the nearest integer. 

Alternatively, the Amelia II program (http://gking.harvard.edu/amelia/) uses a parametric bootstrap MI procedure based on the EM-adjusted covariance matrix and mean vector of the variables to be imputed. It will handle categorical data for you, essentially by rounding off. It will accept Stata .dta input and generate Stata .dta output, though it reads version 7 or something so you often need to use .csv input unless you want to futz with saving in an old format. (Stata will happily read version 7 .dta files but doesn't automatically write them as I recall.) 

Jay

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index