# Re: st: Factor analysis(?) question - missing data

 From Phil Schumm To statalist@hsphsun2.harvard.edu Subject Re: st: Factor analysis(?) question - missing data Date Tue, 22 Apr 2008 13:36:00 -0500

```On Apr 22, 2008, at 1:06 PM, Glenn Hoetker wrote:
```
This is perhaps more of a statistical questions than a Stata question. My situation is this. I have a large dataset in which there are 5-6 indicators each for a bunch of latent variables. Let me take as an example having 5 measures for innovative output, x1- x5. The problem is that very few observations have all 5 measures; some are missing x1, some x2, etc. Almost every observation has at least 3 measures and most 4.

Is there anyway to optimally combine these indicators to measure the underlying construct of innovative output that would use all available measures for a given observation, i.e., x1-x4 for one observation, [x1-x3,x5] for another, etc. If I thought these were equally weighted, I could just average over the available variables in each, setting aside issues of measurement error. However, I'm not convinced they are equally weighted and would like to do this in a more rigorous fashion.

How you approach this will depend critically on whether the missing data are missing at random (MAR), or, more precisely, on whether you are willing to assume that this is so. It is often difficult, if not impossible, to investigate this rigorously.

If you are willing to assume MAR, then you have at least 3 options. You can fit a factor analytic (or other similar) model directly using an algorithm that can accommodate missing data (e.g., the EM algorithm, or, better yet, the ECME algorithm; see, for example, Liu and Rubin, Statistica Sinica 8 (1998), 729-747). I once programmed this (EM) in Stata to handle multiple regression with missing data -- perhaps others have done more. Second, you can fit the model using - gllamm-, which will accommodate missing data under the MAR assumption. And finally, you could use multiple imputation, as implemented for example in Royston's excellent -ice- package (try - ssc describe ice-). In all cases, you could then use empirical Bayes estimates of the latent factors in subsequent analyses, or go on to fit a full structural model.

I'm sure others will have more to say...

-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/