[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Efficient handling of missing data

From	Michael Ingre <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Efficient handling of missing data
Date	Mon, 26 Jan 2004 14:39:27 +0100

Stas Kolenikov:

> If you want to implement the EM algorithm, you'd also need to figure out
> the second moments and the relevant sufficient statistics. That is likely
> to be a lot of matrix computations:

I'm not very good at matrix computations and I appreciate the work from
statisticians like your self who makes these state of the art procedures
available for people like me. I would not dare to try implement it myself
(for now anyway). But I know that there are people out there that do it,
just for the fun of it.

> About three years ago, I've tried to have fun writing a
> program to fully account for the missing regressors in the logistic
> regression using the standard Stata's -ml-, and it basically was
> impractical beyond 50 observations x 3 regressors -- it took about half an
> hour to estimate the model for this HUGE :-\ data set.

I would suspect that just by upgrading your computer (at the time) to today
standard, it would probably run in 5-10 minutes. And tomorrow it would run
in a minute ... Anyway, I see your point. It is computationally intense
operations.

However, with the Multiple Imputation (MI) strategy this is not a key issue.
In many cases you impute once and then do all your analyses at those
datasets. There are limitations of course but the main point is that it is
quite OK even if it would take several hours to impute data. The standard
analyses you apply on your data after imputation will run at normal speed (*
m). 
 
> Those are my two cents. Even though I am writing a dissertation on the
> missing data and the EM algorithm, I think StataCorp is wise enough to be
> careful with the missing data procedures -- I am used to "if something
> obvious is not implemented in Stata, then it is in fact statistically
> wrong, or has so many limitations the user should think twice, or better a
> dozen times, while implementing it himself" ideology, and things like MI,
> even though a long desired addition, still look like an ad-hoc patch
> unless you come from a scientific subgroup where it is beloved and fully
> acceptable.

I come from a scientific subgroup (social sciences, stress and sleep
research) where missing data is common. It is a major concern especially in
longitudinal studies. I could not say however, that MI is beloved and fully
acceptable in my field (that�s an understatement). Not yet.

And I agree with you in principle that you should be careful with applying
statistical procedures that are not considered standard in major statistical
packages like Stata.

However, I think that you may agree with me when I suggest that Stata's
default handling of missing data in most procedures (listwise deletion) is
less than optimal and even statistically wrong in many cases. With MCAR data
it means less power and a higher type 2 error rate. But with MAR data it
also introduces bias in the estimates.

It is not uncommon that 20-40% of the cases are dropped due to missing data
in multivariate analyses. How many times should you think before you draw
conclusions from estimates on data were 40% of the cases have been excluded?

Several ad-hoc procedures with questionable statistical validity has been
adopted in my field to reduce the number of cases lost in the presence of
missing data. In my view (and others) these are often more problematic than
MI. For those of you who are interested in further reading, an excellent
(not so technical) review of missing data from a methodological point of
view, and statistical procedures to handle them, is presented by Schafer and
Graham (2002).

Finally, I think it is safe to say that statistical procedures for handling
missing data is one of the most important areas for further development, at
least in my field.

Michael Ingre

----------------- 
PhD-student 
Department of Psychology
Stockholm University &
National Institute for
Psychosocial Medicine


References

Schafer J. L. & Graham J. W. (2002) Missing data: our view of the state of
the art. Psychol Methods.  2002 Jun; 7(2): 147-77.

PS. Thanks to Patrick Royston for providing the first available MI
procedures for Stata. Soon, I hope to see more of them.

 





*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: Efficient handling of missing data
  - From: Stas Kolenikov <[email protected]>

Prev by Date: st: Time-Spatial aggregation
Next by Date: Re: st: Time-Spatial aggregation
Previous by thread: Re: st: Efficient handling of missing data
Next by thread: st: variance estimation for quasi-panel study
Index(es):
- Date
- Thread