Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missingness

From   Maarten Buis <>
Subject   Re: st: Missingness
Date   Tue, 28 Aug 2012 10:53:03 +0200

On Tue, Aug 28, 2012 at 9:42 AM, Brendan Churchill wrote:
> I am using some ordinal variables, which have some numeric missing values, in a multilevel model. In some previous research, I have seen researchers include a 'Missing' independent variable in their model to account for some of the 'missingness' - or rather to control for the missing values, but I don't quite understand how to do it in Stata or even if that's a good way to do it. I've tried to make a binary variable in which the missing values are coded 1 and the rest of the values are coded 0 but the model rejects this because it's collinear.
> Is this how you do it? Or is there a variable for the entire data set that is created to account for all missing variables?

The most common "method" is to ignore all observations with at least
one missing value. This is fine as long as the probability of
missingness is not related to the dependent/explained/left-hand-side/y
variable. In that case, the estimates will still be consistent, you
just loose power. Since you have missing values on the dependent
variable this means that the probability of missingness needs to be
unrelated to the unobserved values on that dependent variable. There
is obviously no way to check that, but often you have a reasonable
idea how the missing values came to be and you can use that to make it
plausible that this is so. The safest method is indeed to just ignore
the observations with missing values, as long as you can make a
plausible case that the probability of missingness is not related to
the unobserved missing values of the dependent variable (possibly
after controlling for any other variable in your model).

When you believe that the probability of missingness is (strongly)
dependent on the unobserved missing values (even after controlling for
all other variables in your model) than you are in a lot of trouble.
In essence your data does not have the information necessary to
estimate what you want, and no amount of statistical trickery can
create information that is not present in the data. Methods that claim
to deal with these situations just replace information from the data
with "information" from (often untestable) assumptions and the results
from these methods rest rather heavily on the correctness of these
assumptions. In those cases I think it is safer to remember John Tukey
(1986, p.74-75): "The combination of some data and an aching desire
for an answer does not ensure that a reasonable answer can be
extracted from a given body of data."

The method you proposed does not work for dependent variables. For
independent/explanatory/right-hand-side/x-variables you need to be
very careful: This method only makes sense when missing value means
"the value does not exist" rather than "the value exists but has not
been observed". See:

Hope this helps,

John Tukey (1986), "Sunset salvo". The American Statistician 40(1):72-76.

Maarten L. Buis
Reichpietschufer 50
10785 Berlin

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index