Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | ymarchenko@stata.com (Yulia Marchenko, StataCorp LP) |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Use of collapse (sum) in Multiple Imputation |
Date | Wed, 12 Oct 2011 16:37:30 -0500 |
Alberto Zezza <azezza@hotmail.com> asks how to obtain household-level data which can be analyzed using -mi- from multiply-imputed individual-level data: > I have a dataset with both individual and family level variables. > Individuals are uniquely identified by a variable pid, households by a > variable hhid. > > I have missing data for some individuals in an individual level variable x > which I would like to impute, before summing it up over individuals within a > household to obtain a household level variable to use in further analysis. > > Is there a way to do that and carry on the analysis within the mi > environment in the household level file? Alberto then provides code where he uses -collapse- with -mi xeq- to obtain such a dataset, but receives an error: > I am currently doing the following, using an individual level data file: > > mi set wide > mi register imputed x > mi register regular y z > mi impute regress x y z, add(20) > mi xeq: sort hhid; collapse (sum) x _*, by (hhid) > > but the command stops with an error when performing the collapse for m=M (20 > in mi case) saying > > variable _mi_id does not uniquely identify observations in the master data > r(459); > > The variable _mi_id is not 'visible' in my list of variables so I presume this > si something Stata generates in the background to manage mi data. The error Alberto receives is because the -collapse- command should not be used with -mi xeq-. -collapse- substantially modifies the current data similarly to -append-, -merge-, -reshape-, etc. and thus should not be allowed with -mi xeq-. We will modify -mi xeq- to issue an appropriate error message when -collapse- is used. Unlike such commands as -append- and -merge-, the -collapse- command does not have an -mi- analog, e.g. -mi collapse-. However, we can do what Alberto wants manually. Before I proceed with an example, let's first agree on the definition of an incomplete observation in the aggregated (household-level) data. The distinction between complete and incomplete observations is important for the -mi- command. So, we will consider an aggregate observation to be incomplete if there is at least one missing observation among the individual observations used to obtain the aggregate observation. Using Alberto's example, we can obtain household-level data as follows. After the imputation step, we perform: // create household-level sums . mi convert flong, clear . qui mi xeq: by hhid, sort: egen x_sum = total(x) // create incomplete observations in the household-level variable x_sum . qui mi xeq 0: gen Mis_x = (x==.) . qui mi xeq 0: by hhid, sort: egen Mis_total = total(Mis_x) . qui mi xeq 0: replace x_sum = . if Mis_total>0 // create household-level data . qui mi xeq: sort hhid pid; by hhid: drop if _n>1 . qui mi xeq: drop pid x /*include any other individual-level variables*/ // mark incomplete household-level observations . mi register imputed x_sum Below I provide a detailed discussion of the code above. First, it is important to note that many group-specific summaries of imputed variables, such as the household-level sums of x in our example, are so called super-varying variables in individual-level datasets. Super-varying variables are variables which may vary between imputations not only in the incomplete observations but also in the complete observations; see -help mi glossary- for more information. Super-varying variables can exist only in the -flong- (or -flongsep-) style, so we should either start with this style or use -mi convert- to convert to it before we create variables containing group-specific summaries. To create household-specific sums of x, we can use -mi xeq: egen-. So, we start by converting from the previously set -wide- style to -flong-: . mi convert flong, clear and then create a new variable x_sum containing household-specific sums of x: . qui mi xeq: by hhid, sort: egen x_sum = total(x) Because x_sum is a super-varying variable, it should not be registered in the individual-level data. Alberto will need to manually create new household-level variables for any other individual-level variables of interest, which can be done in a loop. Next, we replace all observations of x_sum within a household level for which there is at least one missing value of x with missing values in the original data (m=0): . qui mi xeq 0: gen Mis_x = (x==.) . qui mi xeq 0: by hhid, sort: egen Mis_total = total(Mis_x) . qui mi xeq 0: replace x_sum = . if Mis_total>0 Once all aggregate variables are created, we can drop individual-level observations except the first observation: . qui mi xeq: sort hhid pid; by hhid: drop if _n>1 We can now drop all individual-level variables: . qui mi xeq: drop pid x /*include any other individual-level variables*/ Finally, we register x_sum as imputed to mark incomplete household-level observations. . mi register imputed x_sum The resulting dataset now corresponds to -mi- household-level data. As a side note, Alfredo should consider taking into account the clustered nature of his data during imputation of x. The following FAQ provides some guidelines: http://www.stata.com/support/faqs/stat/impute_cluster.html If Alfredo has any questions, he should contact tech-support@stata.com and they will be happy to help him further. -- Yulia ymarchenko@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/