[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Imputing values for categorical data

From	"Dupont, William" <[email protected]>
To	<[email protected]>
Subject	st: RE: Imputing values for categorical data
Date	Thu, 15 Apr 2004 14:47:15 -0500

Jennifer

In my opinion, imputation makes the most sense when we wish to adjust
for confounding variables.  Suppose that I am primarily interested in
the relationship between y and x, and I have complete data on these two
variables from my data set.  I feel, however, that I should adjust my
analysis for a number of other confounding covariates and I know that
missing values are scattered throughout these covariates.  If I just
regress y against x and these other covariates I get a complete case
analysis: any record that is missing any value of these covariates is
dropped from the analysis.  This can lead to a substantial loss of power
and has the potential to induce bias if having complete data is related
to the response of interest.  Suppose that one of my confounding
variables is gender.  If I have a number of records where y and x are
known but gender is not, it does not seem sensible to throw out this
information just because I would like to adjust my estimates for gender.
If, however, I impute gender I can avoid loosing these data.  As long as
gender is only in the model as a confounder, I don't see that it does
much harm to have an imputed value of say .2 for some patient, which
means that based on her other covariates that she is 5 times more likely
to be of one gender than the other.

A tricky problem with imputation is that we often lack assurance that
the missing values are missing at random.  However, even in this
situation, it is unclear that the complete case analysis is superior to
an imputed analysis for the situation described above.  Imputation
becomes much more problematic when some variables of primary interest
have missing values.

The imputation gurus do not like the single conditional imputation
provided by Stata (see for example Little and Rubin 2002).  This is
because this technique underestimates the standard error of the
regression coefficient for covariates with imputed values and
overestimates the degrees of freedom.  Multiple imputation methods get
around this problem and are fine as long as you are confident that the
missing values are missing at random.  If your are only using imputation
for confounding variables I'm not convinced that it makes much
difference how you do the imputation.  However, multiple imputation is
always theoretically preferable and can avoid hassles in the event that
you come up against a referee who objects to all use of single
conditional imputation.

Bill Dupont

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Jennifer
Wolfe Borum
Sent: Thursday, April 08, 2004 5:50 PM
To: [email protected]
Subject: st: Imputing values for categorical data


Hello,

I am working with a data set composed of responses to survey questions
which contains some categorical variables such as gender and ethnicity.
The data has missing values and I have decided that it would be best to
keep all observations due to a pattern in the missing values. I have
decided to use the impute command in Stata to handle this as I've had
some difficulty and am not familiar enough with the hotdeck and Amelia
imputations. I've found that impute works fine for the continuous
variables, however for the categorical variables I am obtaining values
for which I am unsure how to interpret. For example, I will get an
imputed value of .35621 for gender which is coded 1 or 0. Would anyone
be able to help with the interpretation of the values I am obtaining for
the categorical data?

Also, I would be interested in knowing which approach other Stata users
prefer for imputing values as this is the first time I have encountered
missing values and I am just beginning to research the various methods
of imputation. 

Thanks in advance,
Jennifer

Graduate Student
Florida International University

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: RE: Panel Data Analysis
Next by Date: st: Initial values maximum likelihood
Previous by thread: st: RE: Panel Data Analysis
Next by thread: st: Initial values maximum likelihood
Index(es):
- Date
- Thread