[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Efficient handling of missing data

From	Stas Kolenikov <[email protected]>
To	[email protected]
Subject	Re: st: Efficient handling of missing data
Date	Fri, 23 Jan 2004 10:55:42 -0500 (EST)

> To the best of my knowledge (!), the most valid method to handle missing
> data (MAR & MCAR) is to use Full Information Maximum Likelihood (FIML)
> or Multiple Imputation (MI) techniques. I know that there is a set of
> tools for analyzing MI-datasets available (SJ3-3 st0042) but there seem
> to be no tools available for generating them.

Basically, the tool you would need is the conditional distribution of the
missing data given the observed data. For the multivariate normal
distribuition, this would be a standard formula available from any
textbook on multivariate analysis, something like that:

Mean[y|x] = Mean[y]+Sigma_yx * inv(Sigma_xx) * (x-Mean[x])

for the mean, etc. I suspect that's enough for generating the data for the
MI procedure -- in the standard assumptions of everything being
multivariate normal. Bear with it, that's what all other MI programs
assume, too.

If you want to implement the EM algorithm, you'd also need to figure out
the second moments and the relevant sufficient statistics. That is likely
to be a lot of matrix computations: basically, you need to figure out and
invert a matrix of a special form for each observations. Stata is not very
fast in matrices (here, you need to take out rows and columns for which no
data is available in a given observation to get your Sigma_xx and
Sigma_xy), and it is really slow when you have a cycle observation by
observation. About three years ago, I've tried to have fun writing a
program to fully account for the missing regressors in the logistic
regression using the standard Stata's -ml-, and it basically was
impractical beyond 50 observations x 3 regressors -- it took about half an
hour to estimate the model for this HUGE :-\ data set. I am pretty suyre
that my straightfoward program might be improved in performance by a
factor of ten doing some smart matrix operations, or writing some of the
code in C and compiling it (which was not available with Stata 6 or 7 I
had then), but that is still about hundred times slower than the full data
-logit- command.

BTW, you would have to do this observation-by-observation row/column
elimination and matrix inversion in the MI procedure, anyway. It is not
going to be blast fast, either. So unless StataCorp would want to
implement that as a core set of commands (yes, I know the concept is to
send everything off to ado-files, but this is a clear exception), the
admirers of the missing data routines might want to consider the
alternative software.

Those are my two cents. Even though I am writing a dissertation on the
missing data and the EM algorithm, I think StataCorp is wise enough to be
careful with the missing data procedures -- I am used to "if something
obvious is not implemented in Stata, then it is in fact statistically
wrong, or has so many limitations the user should think twice, or better a
dozen times, while implementing it himself" ideology, and things like MI,
even though a long desired addition, still look like an ad-hoc patch
unless you come from a scientific subgroup where it is beloved and fully
acceptable.

 ---                                    Stas Kolenikov
 --       Ph.D. student in Statistics at UNC-Chapel Hill
 - http://www.komkon.org/~tacik/  -- [email protected]

* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recepients unless they really
* enjoy communicating with the author :). Other restrictions apply.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Efficient handling of missing data
  - From: Michael Ingre <[email protected]>

Prev by Date: st: RE: Axis scale for boxplot
Next by Date: Re: st: hausman type test after xtlogit or xtprobit?
Previous by thread: Re: st: RE: Efficient handling of missing data
Next by thread: Re: st: Efficient handling of missing data
Index(es):
- Date
- Thread