How can I account for clustering when creating imputations with mi impute?
| Title |
|
Accounting for clustering with mi impute |
| Authors |
Wesley Eddings and Yulia Marchenko, StataCorp |
| Date |
August 2010; updated July 2011 |
Note: This frequently asked question (FAQ)
assumes familiarity with multiple imputation. Please see
the documentation entries [MI] intro substantive and [MI]
intro if you are unfamiliar with the method. Also, if your data have
already been imputed, see the documentation entry [MI] mi import on
how to import your data to
mi and see [MI] mi
estimate on how to analyze your multiply imputed data.
As of Stata 11.1, the
mi estimate
command can be used to analyze multiply imputed clustered (panel or longitudinal) data by fitting
several clustered-data models, such as
xtreg,
xtlogit, and
xtmixed; see
mi estimation
for the full list.
However, we must also account for clustering when creating
multiply imputed data; this FAQ will show how.
We can create multiply imputed data with
mi impute,
Stata’s official command for imputing missing values. There is no
definitive recommendation in the literature on the best way to impute
clustered data, but three strategies have been suggested:
- Include indicator variables for clusters in the imputation model.
- Impute data separately for each cluster.
- Use a multivariate normal model to impute all clusters simultaneously.
We will explain how to carry out each strategy with mi impute.
We will assume for now that we have data in long form and
that only one variable has missing values; extensions to more than one
imputed variable will be described later.
Strategy 1: Include indicator variables for clusters in the imputation model
If there are not too many clusters, we can account for clustering by
including cluster indicators in our imputation model. The
factor-variable syntax of Stata makes it easy to include the indicators
with mi impute: we do not even have to
generate
any new variables.
Our first example dataset, data1.dta, has 40 observations within each of 10
clusters; the variable id indexes observations within clusters. Ten
percent of the observations have missing values for the observation-level
predictor x; no values of the response y are missing. We want
to study the association between y and the partially observed
predictor x while accounting for the association within clusters.
. use data1, clear
. describe
Contains data from data1.dta
obs: 400
vars: 4 29 Jul 2010 14:56
size: 9,600
--------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
cluster float %9.0g
id float %9.0g
y double %10.0g
x double %10.0g
--------------------------------------------------------------------------------
Sorted by:
. sort cluster id
. by cluster: summarize y x
--------------------------------------------------------------------------------
-> cluster = 1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
y | 40 97.49968 33.62784 23.14309 160.8217
x | 38 30.00173 7.943642 12.95944 42.72091
--------------------------------------------------------------------------------
-> cluster = 2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
y | 40 100.2756 31.70555 20.78498 151.5145
x | 39 30.77486 8.020621 5.549631 44.48839
--------------------------------------------------------------------------------
-> cluster = 3
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
y | 40 147.5954 37.44895 71.18038 217.404
x | 38 31.85539 8.794805 16.45632 49.47706
--------------------------------------------------------------------------------
...
(output omitted)
We impute the missing values of x with mi impute regress, a
Gaussian regression imputation method. We account for clustering by
including in our imputation model the factor variable i.cluster. The
response y should also be included as a predictor:
. mi set wide
. mi register imputed x
. mi impute regress x y i.cluster, add(5) noisily
Running regress on observed data:
Source | SS df MS Number of obs = 360
-------------+------------------------------ F( 10, 349) = 32.74
Model | 11088.9434 10 1108.89434 Prob > F = 0.0000
Residual | 11821.207 349 33.8716533 R-squared = 0.4840
-------------+------------------------------ Adj R-squared = 0.4692
Total | 22910.1504 359 63.8165749 Root MSE = 5.8199
------------------------------------------------------------------------------
x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y | .1572187 .0088668 17.73 0.000 .1397797 .1746578
|
cluster |
2 | .5249299 1.326672 0.40 0.693 -2.084348 3.134208
3 | -6.21639 1.410625 -4.41 0.000 -8.990786 -3.441994
4 | -1.153281 1.364677 -0.85 0.399 -3.837306 1.530743
5 | .6848743 1.387169 0.49 0.622 -2.043388 3.413136
6 | -4.79826 1.409348 -3.40 0.001 -7.570143 -2.026376
7 | -1.828347 1.34203 -1.36 0.174 -4.46783 .8111363
8 | -1.427531 1.349231 -1.06 0.291 -4.081178 1.226117
9 | 1.565089 1.353659 1.16 0.248 -1.097267 4.227444
10 | -2.067867 1.384883 -1.49 0.136 -4.791633 .6558993
|
_cons | 14.49285 1.287011 11.26 0.000 11.96157 17.02412
------------------------------------------------------------------------------
Univariate imputation Imputations = 5
Linear regression added = 5
Imputed: m=1 through m=5 updated = 0
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
x | 360 40 40 | 400
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
We used the noisily option of mi impute to display the
intermediate regression output, which shows that nine dummy variables were
properly included for the ten clusters. We now fit our analysis model by
using, for example, xtmixed with the mi estimate: prefix:
. mi estimate: xtmixed y x || cluster:
Multiple-imputation estimates Imputations = 5
Mixed-effects ML regression Number of obs = 400
Group variable: cluster Number of groups = 10
Obs per group: min = 40
avg = 40.0
max = 40
Average RVI = 0.0647
Largest FMI = 0.1183
DF adjustment: Large sample DF: min = 314.63
avg = 36515.95
max = 144591.39
Model F test: Equal FMI F( 1, 314.6) = 305.58
Prob > F = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 2.981427 .1705541 17.48 0.000 2.645856 3.316998
_cons | 20.47604 7.020516 2.92 0.004 6.695679 34.2564
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
cluster: Identity |
sd(_cons) | 14.12998 3.418201 8.794777 22.70171
-----------------------------+------------------------------------------------
sd(Residual) | 25.10148 .9518553 23.29734 27.04534
------------------------------------------------------------------------------
The coefficient of x is estimated to be about 3 with a standard error of
about 0.2, and the cluster-level intercepts have a mean of about 20 with a
standard deviation of about 14. Had we not included the cluster variable in
our imputation model, we would have obtained a smaller estimate of the
variance component for clusters.
Graham (2009) suggests that cluster indicators can work well for as
many as 35 indicator variables. Strategy 1 is best suited for data
with few clusters and many observations within each cluster.
Strategy 2: Impute data separately for each cluster
By including clusters as indicator variables in our imputation model (strategy 1), we allow the regression function of the
imputed variable to vary by cluster. More generally, we can allow the
distributions of the imputed values to differ among clusters by imputing
each cluster separately (Graham 2009). In Stata 12, we can use mi impute
with the by() option.
Our second example dataset, data2.dta, like the first, includes a
response variable
with no missing values and a predictor x with 10% missing
values. We have 50 observations within each of 20 clusters. We will
impute each cluster separately
and then fit an analysis model with xtmixed.
. use data2.dta, clear
. mi set wide
. mi register imputed x
. mi impute regress x y, add(5) by(cluster, noreport)
Univariate imputation Imputations = 5
Linear regression added = 5
Imputed: m=1 through m=5 updated = 0
------------------------------------------------------------------
| Observations per m
by() |----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
cluster = 1 | |
x | 44 6 6 | 50
| |
cluster = 2 | |
x | 47 3 3 | 50
...
(output omitted)
cluster = 19 | |
x | 45 5 5 | 50
| |
cluster = 20 | |
x | 46 4 4 | 50
| |
-------------------+-----------------------------------+----------
Overall | |
x | 900 100 100 | 1000
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
We now fit mi estimate: xtmixed to our multiply imputed data:
. mi estimate: xtmixed y x || cluster:
Multiple-imputation estimates Imputations = 5
Mixed-effects ML regression Number of obs = 1000
Group variable: cluster Number of groups = 20
Obs per group: min = 50
avg = 50.0
max = 50
Average RVI = 0.0572
Largest FMI = 0.1076
DF adjustment: Large sample DF: min = 377.90
avg = 69126.54
max = 162018.23
Model F test: Equal FMI F( 1, 440.0) = 1520.59
Prob > F = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 8.143506 .2088359 38.99 0.000 7.733066 8.553946
_cons | 19.31807 6.28803 3.07 0.002 6.993662 31.64247
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
cluster: Identity |
sd(_cons) | 26.28614 4.279303 19.10523 36.16607
-----------------------------+------------------------------------------------
sd(Residual) | 30.27901 .7220861 28.89197 31.73263
------------------------------------------------------------------------------
The coefficient for x is about 8 with a standard error of about 0.2,
and the intraclass correlation is about (262)/(262 +
302) = 0.43. The intraclass correlation ranges from zero to one,
and larger values mean that the clustering variable is more informative.
Imputing each cluster separately requires a sufficient number of
observations in each cluster.
Strategy 3: Use a multivariate normal model to impute all clusters simultaneously
A third way to account for within-cluster correlation is to impute jointly
over clusters using a multivariate normal model. Observations within clusters
may be viewed as a sample from a multivariate normal distribution with
an unrestricted covariance structure. The multivariate normal
strategy works well when there are only a few observations in each cluster
(Allison 2002). There is a limitation to this strategy: it
is best suited to balanced repeated-measures data.
We will illustrate the multivariate normal strategy with a new balanced
dataset. It has 50 clusters but only 5 observations within each
cluster. (Such data might occur, for example, in a repeated-measures study
of subjects’ test scores.) We would once again like to impute missing
values of x and then fit a linear mixed-effects model with
xtmixed.
Before we can fit the multivariate normal imputation model, we will need
to reshape
our data to wide form so that each cluster occupies a single row. The
variable id indexes observations within clusters.
. use data3, clear
. reshape wide x y, i(cluster) j(id)
(note: j = 1 2 3 4 5)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 250 -> 50
Number of variables 4 -> 11
j variable (5 values) id -> (dropped)
xij variables:
x -> x1 x2 ... x5
y -> y1 y2 ... y5
-----------------------------------------------------------------------------
We can now impute with mi impute mvn, and the
multivariate normal regression model will allow interdependencies
within clusters.
. mi set wide
. mi register imputed x1 x2 x3 x4 x5
. mi impute mvn x1 x2 x3 x4 x5 = y1 y2 y3 y4 y5, add(5)
Performing EM optimization:
observed log likelihood = -296.02862 at iteration 16
Performing MCMC data augmentation ...
Multivariate imputation Imputations = 5
Multivariate normal regression added = 5
Imputed: m=1 through m=5 updated = 0
Prior: uniform Iterations = 500
burn-in = 100
between = 100
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
x1 | 42 8 8 | 50
x2 | 43 7 7 | 50
x3 | 46 4 4 | 50
x4 | 47 3 3 | 50
x5 | 47 3 3 | 50
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
To use mi estimate: xtmixed, we need to reshape our data back to long
form. With mi data, we need to use the mi reshape command to
do this:
. mi reshape long x y, i(cluster) j(id)
reshaping m=0 data ...
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 50 -> 250
Number of variables 11 -> 4
j variable (5 values) -> id
xij variables:
x1 x2 ... x5 -> x
y1 y2 ... y5 -> y
-----------------------------------------------------------------------------
reshaping m=1 data ...
reshaping m=2 data ...
reshaping m=3 data ...
reshaping m=4 data ...
reshaping m=5 data ...
assembling results ...
We are now ready to use mi estimate: xtmixed:
. mi estimate: xtmixed y x || cluster:
Multiple-imputation estimates Imputations = 5
Mixed-effects ML regression Number of obs = 250
Group variable: cluster Number of groups = 50
Obs per group: min = 5
avg = 5.0
max = 5
Average RVI = 0.0256
Largest FMI = 0.0775
DF adjustment: Large sample DF: min = 713.36
avg = 27749.21
max = 75833.24
Model F test: Equal FMI F( 1, 713.4) = 54.50
Prob > F = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8719378 .1181111 7.38 0.000 .6400508 1.103825
_cons | 2.627733 2.045674 1.28 0.199 -1.382447 6.637914
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
cluster: Identity |
sd(_cons) | 11.42131 1.191664 9.308991 14.01293
-----------------------------+------------------------------------------------
sd(Residual) | 5.05675 .2543776 4.58195 5.580752
------------------------------------------------------------------------------
Conclusion
All three strategies can be modified to impute more than one variable. The
indicator-variable and separate-imputation strategies, strategies 1 and 2,
require a multivariate imputation method such as mi impute monotone or mi
impute mvn in place of a univariate method such as mi impute
regress. The multivariate normal strategy, strategy 3, can be extended
by adding extra variables to the left-hand side of the equation in mi impute
mvn. If we wanted to impute x and another variable z, the
commands might look like this:
. reshape wide x y z, i(cluster) j(id)
. mi set wide
. mi register imputed x1 x2 x3 x4 x5 z1 z2 z3 z4 z5
. mi impute mvn x1 x2 x3 x4 x5 z1 z2 z3 z4 z5 = y1 y2 y3 y4 y5, add(5)
. mi reshape long x y z, i(cluster) j(id)
All our examples had the same two-level structure—observations
within clusters. More-complex multilevel structures are an active research
area; one recent paper describing imputation for multilevel models is
Goldstein et al. (2009).
References
- Allison, P. D. 2002.
- Missing Data. Thousand Oaks, CA: Sage.
- Goldstein, H., J. R. Carpenter, M. G. Kenward, and K. A. Levin. 2009.
- Multilevel models with multivariate mixed response types. Statistical Modelling 9: 173–197.
- Graham, J. W. 2009.
- Missing data analysis: Making it work in the real world. Annual Review of Psychology 60: 549–576.
|