Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Multiple Imputation and other Missing Data

From   "Groves" <>
To   "stata" <>
Subject   st: Multiple Imputation and other Missing Data
Date   Wed, 23 Nov 2005 19:53:52 -0600

Dear list,

I'm hoping that one or more of you can take a minute or two to skim through
the following several paragraphs and provide some feedback to someone
marginally familiar with statistics in general (and missing data techniques
in particular) about whether these paragraphs make any sense and whether
they provide any indication about the knowledge level of their author.

A student wrote these paragraphs in an attempt to summarize the methods he
used to conduct a longitudinal analysis purportedly employing some type of
advanced missing data technique. These paragraphs were intended for
publication on a jointly authored manuscript, and thus my desire to make
certain that they are reasonably accurate.

From my read, however, I am unable to determine what type of missing data
technique the student is claiming to have used and more generally, whether
any of these explanations even make sense. Unfortunately, although I can
sense that something is wrong, I lack the experience to put my finger on
the exact problem. Succinctly, would you advise for or against placing
one's name as a co-author on a paper containing the following paragraphs
(even after a reasonable degree of copy-editing). Any feedback (on or
off-list) would be greatly appreciated. I've extracted the sections that
appear most relevant. The section entitled "Missing Data," however, is
nearly complete as provided by the student and was intended to be a full
description of relevant missing data issues and an explanation for the
exact techniques used in the present research.

Thanks in advance for your comments.

Data Collection

The data used in this study were collected starting in 19** from a
target population made up of the seventh graders enrolled in a random half
of all the junior high schools in the **** School District. These
adolescents were surveyed again in 19** , and in 19** .

The selection criteria needed for the subjects to be included in the
sample were that they provided data in both the Time 1 and Time 2 data
collection waves ... .

Missing Data

Ignorable missing data is usually a product of two types of
mechanisms, missing completely at random (MCAR) and missing at random
(MAR). Data is MCAR when a subject's nonresponse to a question is not
dependent on any other measured or observed variable related to the
subject, study, or the question itself. If a subject's nonresponse to a
given question is contingent on subject characteristics or a previous
response, but not dependent on the question itself then the data are
considered MAR (Rubin 1976, Enders and Bandalos 2001). It should be
evident that MCAR is the stronger assumption, because data that is MCAR is
also MAR.

Missing data in the variables reported here are assumed to be the less
restrictive MAR type, however given the nature of the variables it is
possible that subjects' responses might not even meet MAR
assumptions. There several methods for addressing missing data. Such
methods include theory based direct maximum likelihood or full information
likelihood (FIML), listwise and pairwise deletion and different forms of
multiple imputation. In general, the majority of recent research into the
efficiency of missing data methods has shown that direct maximum likelihood
techniques out perform all other methods (Enders 2001, Little and Rubin 1987).

One drawback of the direct ML method is that it assumes multivariate
normality similar to all ML estimation methods. However, little is known
about how these methods work in the presence of nonnormal data and/or
clustered data such as used for the study. If it follows other ML
estimation techniques, then most likely parameters will be increasingly
biased as the degree of nonnormality and clustering increase. One form of
imputation called the similar response pattern method has been implemented
in PRELIS 2, which is a preprocessor for the LISREL program (Joreskog and
Sorbom 1996).

The method attempts to impute real values from another case with
similar observed values by using a minimization routine based on a set of
matching variables. If the routine cannot find a case with complete data
using the matching variables then the missing value for that variable is
not imputed into the case and remains missing. A study by Brown (1994)
found that compared to listwise and pairwise deletion, mean imputation, and
hot-deck imputation, similar response pattern imputation produced the least
bias overall in regard to structural and measurement model
parameters. However, he did find that there was some positive bias in the
error estimates indicating that Type 1 error rates would be larger than

Although there is no statistical theory that would support this method
over direct missing data methods, the fact that it imputes values from
similar cases is attractive, because of the clustered nature of the second
generation data. If it is plausible that children from the same family
would have more similar responses to each other than to children from other
families then possibly imputing a value from a respondent's sibling does
have some validity. As suggested in the PRELIS manual a large number of
matching variables were used, including subject identification numbers,
that were not otherwise used to select the subjects or used in any of the
model estimations as moderators, indicators, or other variables.

* For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index