Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Hotdeck imputation

From   "Daniel Waxman" <[email protected]>
To   <[email protected]>
Subject   RE: st: Hotdeck imputation
Date   Mon, 13 Jun 2005 07:30:02 -0400

Thank you very much for taking the time to reply.
It is crystal clear, and extremely helpful.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of maartenbuis
Sent: Monday, June 13, 2005 5:02 AM
To: [email protected]
Subject: Re: st: Hotdeck imputation

--- "Daniel Waxman" <dan@a...> wrote:
> I need to do a relatively simple imputation, but am having trouble 
> following the examples given.  
> Here is the situation:
> Dataset ~ 10,000 obs (non-weighted, 1 obs/subject)
> Variable to be imputed:
> EKG_abnormal     --binary(yes/no),  missing at random < 5% of 
> observations.
> Potential predictors with which to impute:  
> At least five, some binary (e.g. chestpain yes/no, first_cat (1-5), 
> etc.)
> some which are continuous but can be made categorical (e.g. age ==> 
> age_cat)
> Primary outcome being studied:  Death yes/no
> The questions:
> (1) Should I use the outcome variable (death) as one of imputation
> variables?  Should I use many imputation variables since I can
> (large dataset?
> (2) Most important:  Can somebody give an example for the correct
> way to issue the commands?
> If I do the following:
> . hotdeck ekg_abnormal using imp, by(agecat first_cat) store
> keep(merge_variable) impute(5)
> Then I end up with 5 files, imp1 imp2 imp3 imp4 imp5
> Eventually I want to end up with imputed values for ekg_abnormal
> that I can use the main logistic regression equation of interest. 
> Not sure where the options infile(), command(logit) fit into things.

The two questions are related: -hotdeck- produces multiple files 
because it does the Multiple Imputation variant of hotdeck 
imputation, and because it does multiple imputation you should also 
include your dependent variable. You should include the dependent 
variable since if you don't the missing values are imputed assuming 
that there is no relation between ekg_abnormal and death. So the 
relationship between these two variables estimated using the imputed 
datasets will be underestimated. 

Adding more variables in the imputation makes the MAR assumption more 
likely, but increases the probability that some of the cells are very 
sparce. Empty or nearly empty cells should be avoided in hotdeck 
imputation. So you should add variables that are strongly related 
with the imputed variable, and you should add as many as possible 
without creating sparce cells. 

The idea behind Multiple Imputation is as follows: If you just impute 
ones you assume that you are as sure about the imputed values as you 
are about the observed values. So, if you impute ones you 
underestimate the standard error, i.e. you think you are more sure 
about the parameter than you realy are. However, the observed cases 
in each cell also give information about the distribution of likely 
values of the missing observations (under the MAR assumption). You 
can for each missing value draw at random a number of values, e.g. 5, 
from this distribution, and thus create 5 completed datasets. These 
are the completed datasets you got from the -hotdeck- command. You 
can now estimate the model of interest for each completed dataset. 
The variation in estimates between completed datasets is a measure of 
the added uncertainty due to using imputed values. The procedure used 
by -hotdeck- is described in: (Rubin 1987, p. 122-124), or (Allison 
2002, p. 57-58). 

The command you could use is:
hotdeck ekg_abnormal, by(chestpain firstcat agecat death) command
(logit death chestpain age firstcat ekg_abnormal) parms(chestpain age 
firstcat ekg_abnormal _cons) impute(5)

This will generate the datasets, estimates the model of interest (the 
model specified in the command-option), and combines the results 
(those put in the parms-option) for you. 

Hope this helps,

Donald Rubin (1987) "Multiple Imputation for Nonresponse in Surveys", 
New York: Wiley.
Paul Allison (2002) "Missing Data", Thousand Oaks: Sage.

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index