Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Multiple Imputation in Longitudinal Multilevel Model

From   Anthony Fulginiti <>
Subject   Re: st: Multiple Imputation in Longitudinal Multilevel Model
Date   Wed, 6 Mar 2013 09:47:59 -0800 (PST)

Thank you Jay and Stas for such helpful feedback!  

For clarification, the read1-read3 represents data at 3 different time points for an individual. However, that was code from the referenced UCLA document, which I used as a template for my own coding (using the variables in the imputation and xtmixed models below).  In any case, I simplified the model for the post but had followed your suggestions and used a host of variables (known to relate to the outcome variable in the literature) in the imputation model and the xtmixed model.  

My imputation model code looked like:

ice bprstot0 isetot0 age0 total_ip_episodes0 length_illness0  minority_dummy0 female_dummy0 bprstot1 isetot1 bprstot2 isetot2 bprstot3 isetot3 bprstot4 isetot4 bprstot5 isetot5 bprstot6 isetot6, saving(imputed_dataset) m(4)

My xtmixed model code is:

xtmixed Self_Esteem c_time c_time2 EthnicityByc_time EthnicityBytime2 bprstot length_illness || id: c_time, covariance(un) variance mle

To your earlier points:

1)I will increase the number of imputations.

2)I did not include time in my imputation model because my understanding was to perform the imputation in wide format rather than the long format (when reshaping from long to wide, the time indicator variable is dropped until I subsequently reshape for the use of the mim command). Perhaps if I follow your suggestions and use the MI feature, I can incorporate time and the quadratic for time since I found that polynomial function is the best fit for the existent data.  

*A fundamental issue that I was trying to address is that I know that certain variables, such as symptoms relate to the outcome.  However, I have missing values in several of those relevant independent variables as well as in the outcome variable. My understanding was that the chained equation approach (with ice and then the mim command) would work well when missing values for both independent and dependent variables.

I will further review the MI manual but let me know if you have any additional thoughts or helpful hints on the matter. I appreciate the insight and perspective on such challenging material.  

Sincerely, Anthony

On Wed, Mar 6, 2013 at 9:32 AM, Stas Kolenikov <> wrote:
1. Are read1-read3 and math1-math3 three measurements taken at the
same time for a given individual, or measurements taken over three
periods? If the former, then your model is "flat", as it does not
recognize and utilize the longitudinal/multilevel nature of the data.

Yes, you need to put that in, which can be quite challenging. Usually
you need to add in some independent variables to capture the time and
panel trend aspects. If you can afford to add in dummies for each
group (i.e., fixed effects) it's worth it, and for the time structure
a linear, quadratic and cubic term, or some kind of regression spline
structure is also worth considering.

2. Once you've done -ice-, don't touch anything (let alone anything as
drastic as -drop if _mj==0-), and use -mi: estimate- for everything. I
don't really know how well either -mi- or -ice- go with -reshape-, but
I suspect that if not done properly, it will screw up the delicate
mechanics of -mi-.

And given that you can use chained equations in MI, I'd really suggest
doing things with MI directly, not -ice-. Nothing bad about -ice-, but
being able to run entirely in MI is likely to be much easier.

3. I agree with Jay that 4 imputations are woefully insufficient. I
have heard the arguments that you don't see much Monte Carlo
variability beyond 5 imputations, but I can put two arguments in favor
of a much greater number, like M=50: first, you don't explore the
multivariate space of missing data enough (M=5 may be OK for a
univariate mean, but I can't see how it can work for a 30-dimensional
space), and second, I want my minimum degrees of freedom to be greater
than the nominal sample size, so that the limitation on the accuracy
really comes from the data rather than the computer.

The original argument came from Don Rubin doing some calculations on
univariate means and OLS regression coefficients. It really doesn't
extend past that. Kenward & Carpenter did some work on this suggesting
that you should have many more imputations. This is discussed in the
MI manual, p. 5, with citations. But it depends on what you want to
know, so for a univariate mean it's no big deal and you can get away
with small imputations whereas if you're doing logistic regression on
relatively rare events you need to have many more.

4. If you are bringing additional variables to the -xtmixed- model,
you would probably have been better off using these variables in
imputation. You had a reason to believe that they affected the
response, and for that same reason they should be in the imputation

I'll go one step further: The imputation model needs to be more
comprehensive than the analysis model.
*   For searches and help try:

--- On Tue, 3/5/13, Anthony Fulginiti <> wrote:

> From: Anthony Fulginiti <>
> Subject: st: Multiple Imputation in Longitudinal Multilevel Model
> To:
> Date: Tuesday, March 5, 2013, 10:54 PM
> Dear Statalist,
> I have been trying to better understand multiple imputation
> in the context of longitudinal multilevel modeling in Stata
> using Stata 12.1.  Based on my review of Stata
> documentation on multiple imputation, it seems as though a
> common choice for multiple imputation with a pattern of
> missing data being arbitrary rather than monotonic is using
> the ice command w/ subsequent MIM command. I was consulting
> an FAQ document from UCLA
> (, which
> provides an example of using a code strategy for MI in
> longitudinal data as follows:
> reshape wide read math, i(id) j(time)
> set seed 091107
> ice female private ses read1 read2 read3 math1 math2 math3,
> saving(imputed_dataset) m(4)
> use imputed_dataset, clear 
> tab _mj
> sum female private ses read1 read2 read3 math1 math2 math3
> drop if _mj==0
> reshape long read math, i(id _mj)
> sum female private ses read math
> rename _j time
> mim: xtreg read math time, i(id) 
> xtreg read math time if _mj==1, i(id)
> My dataset has 7 datapoints and the code works fine. 
> However, I have 2 questions:
> 1) So now that I have the 4 datasets with multiply imputed
> values, do I have to take an additional step for model
> testing or when I run the models with the mim prefix, is
> that using the information derived from all of the multiply
> imputed datasets?
> 2) My understanding is that the xtreg command is only used
> for random intercept models.  However, if I am running
> a growth curve model with not only a random intercept but
> also a random slope of time, is there anything fundamentally
> flawed with using the xtmixed command, which I have
> typically used for performing growth curve analysis with a
> multilevel model.  When I run it, there are no error
> messages but that doesn't mean there aren't errors in my
> logic/approach/understanding so I wanted to seek your
> feedback.
> My code looks like:
> mim: xtmixed Self_Esteem c_time Ethnicity EthnicityBytime||
> id: c_time, covariance(un) variance mle
> My apologies for the lengthy email.  Please let me know
> if the addition of output would be of any help in offering
> advice.  I thank you in advance for your feedback.
> Respectfully Yours, Anthony
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index