Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: missingness in a large, complex sampling deign

From   "Stanislav Kolenikov" <[email protected]>
To   [email protected]
Subject   Re: st: missingness in a large, complex sampling deign
Date   Mon, 16 Aug 2004 17:32:38 -0000

--- In [email protected], "Colleen Daly Martinez"
<colleendalymartinez@c...> wrote:
> In my analysis of data from a large (over 5,000) nationally 
> representative
> sample study, which used complex sampling, I'm finding that a number of
> variables I'm examining have many missings- as many as 1,500 or more (I
> believe that they are non-respondents).
> I'm wondering if anyone has suggestions, or can point me to
> references which
> address the issue of managing this issue and the implications for my
> analysis.

Your concern is directly related to the analysis you would want to
perform on it. Are you planning means and tabulations? Factor
analysis? Regression? Depending on the type of analysis, you may or
may not need to involve some of the heavy mahinery described in Little
& Rubin's or Schaffer's books (although one thing that you certainly
need to know about your data is the distinction of MAR, MCAR and NMAR;
pick those up as quickly as you can if you have not seen this before!).

There are two main approaches to the missing data currently on the
market: imputation of some kind, and integrating the missing data out.
In the first approach, you try to come up with some reasonable number
for the missing cell: in regression imputation, that is a linear
prediction given other variables; in hot-deck imputation, it is a
random pick from the same stratum; in multiple imputation, it is a
random pick from a joint distribution of the variables of interest
repeated several times and combined together appropriately. In the
latter approach, you try to write down the likelihood for the complete
data, see what it looks like for the available data (you would need to
take the expectation conditional on those observed data, which is
where the integration comes in), and see if this can be maximized
reasonably easily. It is not clear which of the approaches is making
stronger assumptions (and thus less robust).

Note that Stata does not have a multiple imputation procedure, and
lots of users have complained about this, but not without a reason, as
it is often the case with Stata: if something obvious is not
implemented, may be it is not so obvious to begin with? Just as there
is no single bootstrap procedure that will work in 100% cases, and you
need to think about things like smoothness of your distribution
fucntional, dependencies in your data and pivoting your statistic, you
have to make a lot of substantial choices in the multiple imputation
before it starts giving sensible results.

Theoretically, you can incorporate all of the missing data in a single
maximum likelihood procedure, and feed that into Stata's great -ml-
maximizer, but I found this quite impractical in my simulations
studies when PCs were under 1MHz four years ago for my logistic
regression with missing covariates, 50 observations and 3 explanatory
variables that took about half an hour to converge. I would expect
some gains in speed due to overall increase in the computational
power, plus marginal increase due to continuous improvement of Stata's
code, but that will hardly bring a factor of ten together. (I think
these days I may really benefit from Stata's plug-ins, but I've never
tried them so far.)


*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index