# Re: st: St: How to handle missing observations in the factor-principal component analysis

 From Maarten buis To statalist@hsphsun2.harvard.edu Subject Re: st: St: How to handle missing observations in the factor-principal component analysis Date Wed, 19 Dec 2007 11:02:31 +0000 (GMT)

```--- Simo Hansen <simohansen@gmail.com> wrote:
> I try to construct knowledge index for women in my data. I have some
> missing observation for the variables that are converted to dummy
> variables to conduct the factor analysis. Could anyone provide a help
> about how to handle those missing observations?

This looks problematic, even without missing data. At the very bottom I
will suggest a solution to the missing data problem if you still wish
to proceed with this analsys. Consider the logic behind factor
analysis:

We imagine that there is one or more unobserved variables (f) that
influence the observed variables (x) in a linear way. Say we have three
observed variables (x1, x2, and x3) and one factor (f), and that both
the observed variables and the factor are standardized, so there is no
constant. So, we get the following system:

x1 = l1 f + e1
x2 = l2 f + e2
x3 = l3 f + e3

We don't observe f, but if we asssume that the errors across equations
are not correled, than all correlation between x1, x2, and x3 is due to
the fact that they have f in common. We use this to reconstruct f.

The problem is that if any of the xs is a dummy than the assumption of
a linear effect of a variable on that x can fail. If you turn a
catorgical or ordinal variable into dummies than you are adding
dependencies between your variables that have nothing to do with the
common factor but are still assigned to that factor.

> The other question I have that there is a
> following command in SPSS:
> /Missing MeanSub.
> How can I write this command in Stata?

This looks like mean imputation to me. This is a very very bad idea.
Remember that factor analysis uses the correlations between variables
to reconstruct the latent factor. With mean imputation you seriously
distort those correlations. Consider two variables: x1 and x2, where x1
has missing data, which are replaced by the mean. The consequences are
shown in the graphs below:

|      ***         |      ***
|    *****         |    *****
|   *****          |   *****
x1|  *****           |xx*****xxx
| *****            | *****
|*****             |*****
|***               |***
---------------    ---------------
x2                  x2

The xs in the right graph are the imputed values, it is clear that they
seriously distort the correlation. In particular this leads to an
underestimation of the correlation.

The help file of -factor- also links to the -impute- command, which
does regression imputation. This too is a bad idea. It puts all the
missing values on the regression line, as is shown in the graphs below,
and thus overestimates the correlation.

|      ***         |      *x*
|    *****         |    **x**
|   *****          |   **x**
x1|  *****           |  **x**
| *****            | **x**
|*****             |**x**
|***               |*x*
---------------    ---------------
x2                  x2

A better method is to use -ice-, which can be downloaded from -ssc-.
This will preserve the actual correlation, by adding the necesary noice
around the regression line:

|      ***         |      x**
|    *****         |    ***x*
|   *****          |   x****
x1|  *****           |  ****x
| *****            | **x**
|*****             |***x*
|***               |x**
---------------    ---------------

Notice that it is not necesary to let -ice- make multiple imputed
datasets if you are only interested in the point estimates for the
factor scores. The multiple imputations are only used for adjusting the
standard errors. So my suggestion to your missing data problem is: use
-ice- to create an imputed dataset, and use that dataset to do the
factor analysis. Remember to add -if _mj==1- to the -factor- command
(-ice- stores your original data on top which is identified by the
value 0 on the variable _mj).

Hope this helps,
Maarten

Ps. Below is a simulation showing what the different methods do to the
correlations:

*----------------- begin example -----------------------
set more off
sysuse auto, clear

global true = r(rho)
cd "h:\temp"
capture program drop sim
program sim, rclass
sysuse auto, clear
replace headroom = . if uniform() < invlogit(-1 - .1* trunk)

return scalar imp = r(rho) - \$true

return scalar mean = r(rho) - \$true

use temp, clear
corr headroom trunk if _mj == 1
return scalar ice = r(rho) - \$true
end
simulate imp=r(imp) mean=r(mean) ice=r(ice), reps(10000) : sim

twoway kdensity imp || kdensity mean || kdensity ice,       ///
legend(order(1 "-impute-" 2 "mean" "imputation" 3 "-ice-")) ///
xtitle("deviation from true correlation")
*------------------ end example ------------------------
(For more on how to use examples I sent to the Statalist, see
http://home.fsw.vu.nl/m.buis/stata/exampleFAQ.html )

-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

http://home.fsw.vu.nl/m.buis/
-----------------------------------------

___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.