[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Daniel Waxman" <dan@amplecat.com> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Hotdeck imputation |

Date |
Mon, 13 Jun 2005 07:30:02 -0400 |

Maarten, Thank you very much for taking the time to reply. It is crystal clear, and extremely helpful. -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of maartenbuis Sent: Monday, June 13, 2005 5:02 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: Hotdeck imputation --- "Daniel Waxman" <dan@a...> wrote: > I need to do a relatively simple imputation, but am having trouble > following the examples given. > Here is the situation: > > Dataset ~ 10,000 obs (non-weighted, 1 obs/subject) > > Variable to be imputed: > EKG_abnormal --binary(yes/no), missing at random < 5% of > observations. > > Potential predictors with which to impute: > At least five, some binary (e.g. chestpain yes/no, first_cat (1-5), > etc.) > some which are continuous but can be made categorical (e.g. age ==> > age_cat) > > Primary outcome being studied: Death yes/no > > The questions: > (1) Should I use the outcome variable (death) as one of imputation > variables? Should I use many imputation variables since I can > (large dataset? > > (2) Most important: Can somebody give an example for the correct > way to issue the commands? > > If I do the following: > > . hotdeck ekg_abnormal using imp, by(agecat first_cat) store > keep(merge_variable) impute(5) > > Then I end up with 5 files, imp1 imp2 imp3 imp4 imp5 > Eventually I want to end up with imputed values for ekg_abnormal > that I can use the main logistic regression equation of interest. > Not sure where the options infile(), command(logit) fit into things. The two questions are related: -hotdeck- produces multiple files because it does the Multiple Imputation variant of hotdeck imputation, and because it does multiple imputation you should also include your dependent variable. You should include the dependent variable since if you don't the missing values are imputed assuming that there is no relation between ekg_abnormal and death. So the relationship between these two variables estimated using the imputed datasets will be underestimated. Adding more variables in the imputation makes the MAR assumption more likely, but increases the probability that some of the cells are very sparce. Empty or nearly empty cells should be avoided in hotdeck imputation. So you should add variables that are strongly related with the imputed variable, and you should add as many as possible without creating sparce cells. The idea behind Multiple Imputation is as follows: If you just impute ones you assume that you are as sure about the imputed values as you are about the observed values. So, if you impute ones you underestimate the standard error, i.e. you think you are more sure about the parameter than you realy are. However, the observed cases in each cell also give information about the distribution of likely values of the missing observations (under the MAR assumption). You can for each missing value draw at random a number of values, e.g. 5, from this distribution, and thus create 5 completed datasets. These are the completed datasets you got from the -hotdeck- command. You can now estimate the model of interest for each completed dataset. The variation in estimates between completed datasets is a measure of the added uncertainty due to using imputed values. The procedure used by -hotdeck- is described in: (Rubin 1987, p. 122-124), or (Allison 2002, p. 57-58). The command you could use is: hotdeck ekg_abnormal, by(chestpain firstcat agecat death) command (logit death chestpain age firstcat ekg_abnormal) parms(chestpain age firstcat ekg_abnormal _cons) impute(5) This will generate the datasets, estimates the model of interest (the model specified in the command-option), and combines the results (those put in the parms-option) for you. Hope this helps, Maarten Donald Rubin (1987) "Multiple Imputation for Nonresponse in Surveys", New York: Wiley. Paul Allison (2002) "Missing Data", Thousand Oaks: Sage. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Hotdeck imputation***From:*"maartenbuis" <maartenbuis@yahoo.co.uk>

- Prev by Date:
**st: estimating a probit model on panel data with gllamm** - Next by Date:
**Re: st: Reading the letters in a name up to a particularcharacter in Stata 8.2** - Previous by thread:
**Re: st: Hotdeck imputation** - Next by thread:
**st: Two new modules on SSC** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |