Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Imputing values for categorical data


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Imputing values for categorical data
Date   Fri, 9 Apr 2004 10:43:36 +0100

Well, yes and no. Renzo's advice is clear and sensible, 
and I'd add an even more general warning: Don't assume
that there is a single wonderful magic way of imputing
missing data. 

Nevertheless it seems to me that the on-line 
help and manual entry for -impute- are pretty clear
that -impute- is based on regression. It doesn't seem
to me that extraordinary to assume that users know what
regression is and are capable of thinking whether it is 
a sensible technique for their problem. One sure way of 
bloating the  manuals further is to add warnings almost
everywhere along the lines of "Think carefully whether 
this command is appropriate for your problem and your data."

Nick 
n.j.cox@durham.ac.uk 

Renzo Comolli

> I have one piece of advice: be very careful when using -impute-
> It is not suitable to impute categorical variables, and I am 
> surprise the manual does not mention that. 
> When I actually "ripped the ado file open" an saw what it 
> does I gave up on
> imputing categorical variables, but I had never done 
> imputations before so I have very little knowledge of the field
> 
> At its core, -impute- does a simple OLS projection.
> Let me explain with a simplified case first and then with a 
> more complicated case.
> Simplifying assumption: only one variable (denoted by y) 
> necessitates to be
> imputed, all the other variables (denoted by matrix X) have 
> no missings.
> Without loss of generality assume that you have ordered the 
> variable y so
> that all the cases for which you have observations appear at 
> the top (denote
> this part of the vector y'), and all the missings at the 
> bottom, denote this
> part of the vector y by y". Also denote by X' and X" the corresponding
> values of X (remember that X has no missings, X" just 
> contains the X values
> corresponding to the observation y")
> Then -impute- trivially does OLS of y'=X'beta+epsilon where 
> beta is the OLS
> vector of coefficients. It saves it and imputes y" by doing X"beta
> So of course this is completely unsuitable for cases 
> categorical variables.
> Even with continuous variables you have to be careful not to 
> predict "out of
> range". Let's assume that you are predicting "number of weeks 
> of work", it
> might well happen that -impute- predicts that the interviewee 
> worked -1
> weeks last year 
> 
> The case is not that simple when the matrix X contains 
> missing variables
> itself. If so, -impute- looks for the best subset of 
> regressors. In practice
> -impute- repeats the procedure explained here above several 
> times trying to
> keep as many regressors as possible (exactly how I did not 
> understand either
> from the ado file or from the manual, but I did not spend 
> much time on it,
> because I did not care that much.
> 
> Said that, I did not know of these other methods you 
> mentioned (hotdeck,
> Amelia) and I would be glad to read what others have to say about it.
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index