# Re: st: Imputing values for categorical data

 From "Renzo Comolli" To Subject Re: st: Imputing values for categorical data Date Thu, 8 Apr 2004 23:29:44 -0400

```Hi Jennifer,

I have one piece of advice: be very careful when using -impute-
It is not suitable to impute categorical variables, and I am surprise the
manual does not mention that.
When I actually "ripped the ado file open" an saw what it does I gave up on
imputing categorical variables, but I had never done imputations before so I
have very little knowledge of the field

At its core, -impute- does a simple OLS projection.
Let me explain with a simplified case first and then with a more complicated
case.
Simplifying assumption: only one variable (denoted by y) necessitates to be
imputed, all the other variables (denoted by matrix X) have no missings.
Without loss of generality assume that you have ordered the variable y so
that all the cases for which you have observations appear at the top (denote
this part of the vector y'), and all the missings at the bottom, denote this
part of the vector y by y". Also denote by X' and X" the corresponding
values of X (remember that X has no missings, X" just contains the X values
corresponding to the observation y")
Then -impute- trivially does OLS of y'=X'beta+epsilon where beta is the OLS
vector of coefficients. It saves it and imputes y" by doing X"beta
So of course this is completely unsuitable for cases categorical variables.
Even with continuous variables you have to be careful not to predict "out of
range". Let's assume that you are predicting "number of weeks of work", it
might well happen that -impute- predicts that the interviewee worked -1
weeks last year

The case is not that simple when the matrix X contains missing variables
itself. If so, -impute- looks for the best subset of regressors. In practice
-impute- repeats the procedure explained here above several times trying to
keep as many regressors as possible (exactly how I did not understand either
from the ado file or from the manual, but I did not spend much time on it,
because I did not care that much.

Said that, I did not know of these other methods you mentioned (hotdeck,

Best,
Renzo Comolli

----------------------------------------------------------------------------
----
*From   Jennifer Wolfe Borum <jjfrog@bellsouth.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Imputing values for categorical data
Date   Thu, 8 Apr 2004 18:50:21 -0400

----------------------------------------------------------------------------
----

Hello,

I am working with a data set composed of responses to survey questions which
contains some categorical variables such as gender and ethnicity. The data
has missing values and I have decided that it would be best to keep all
observations due to a pattern in the missing values. I have decided to use
the impute command in Stata to handle this as I've had some difficulty and
am not familiar enough with the hotdeck and Amelia imputations. I've found
that impute works fine for the continuous variables, however for the
categorical variables I am obtaining values for which I am unsure how to
interpret. For example, I will get an imputed value of .35621 for gender
which is coded 1 or 0. Would anyone be able to help with the interpretation
of the values I am obtaining for the categorical data?

Also, I would be interested in knowing which approach other Stata users
prefer for imputing values as this is the first time I have encountered
missing values and I am just beginning to research the various methods of
imputation.

Jennifer