[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Imputing values for categorical data |

Date |
Fri, 9 Apr 2004 10:43:36 +0100 |

Well, yes and no. Renzo's advice is clear and sensible, and I'd add an even more general warning: Don't assume that there is a single wonderful magic way of imputing missing data. Nevertheless it seems to me that the on-line help and manual entry for -impute- are pretty clear that -impute- is based on regression. It doesn't seem to me that extraordinary to assume that users know what regression is and are capable of thinking whether it is a sensible technique for their problem. One sure way of bloating the manuals further is to add warnings almost everywhere along the lines of "Think carefully whether this command is appropriate for your problem and your data." Nick n.j.cox@durham.ac.uk Renzo Comolli > I have one piece of advice: be very careful when using -impute- > It is not suitable to impute categorical variables, and I am > surprise the manual does not mention that. > When I actually "ripped the ado file open" an saw what it > does I gave up on > imputing categorical variables, but I had never done > imputations before so I have very little knowledge of the field > > At its core, -impute- does a simple OLS projection. > Let me explain with a simplified case first and then with a > more complicated case. > Simplifying assumption: only one variable (denoted by y) > necessitates to be > imputed, all the other variables (denoted by matrix X) have > no missings. > Without loss of generality assume that you have ordered the > variable y so > that all the cases for which you have observations appear at > the top (denote > this part of the vector y'), and all the missings at the > bottom, denote this > part of the vector y by y". Also denote by X' and X" the corresponding > values of X (remember that X has no missings, X" just > contains the X values > corresponding to the observation y") > Then -impute- trivially does OLS of y'=X'beta+epsilon where > beta is the OLS > vector of coefficients. It saves it and imputes y" by doing X"beta > So of course this is completely unsuitable for cases > categorical variables. > Even with continuous variables you have to be careful not to > predict "out of > range". Let's assume that you are predicting "number of weeks > of work", it > might well happen that -impute- predicts that the interviewee > worked -1 > weeks last year > > The case is not that simple when the matrix X contains > missing variables > itself. If so, -impute- looks for the best subset of > regressors. In practice > -impute- repeats the procedure explained here above several > times trying to > keep as many regressors as possible (exactly how I did not > understand either > from the ado file or from the manual, but I did not spend > much time on it, > because I did not care that much. > > Said that, I did not know of these other methods you > mentioned (hotdeck, > Amelia) and I would be glad to read what others have to say about it. > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: -joinby- problem***From:*"Clive Nicholas" <Clive.Nicholas@newcastle.ac.uk>

- Prev by Date:
**RE: st: RE: -for- versus -for each-** - Next by Date:
**st: -joinby- problem** - Previous by thread:
**Re: st: Imputing values for categorical data** - Next by thread:
**st: -joinby- problem** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |