[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: St: How to handle missing observations in the factor-principal component analysis

From   "Simo Hansen" <>
To   <>
Subject   RE: st: St: How to handle missing observations in the factor-principal component analysis
Date   Wed, 19 Dec 2007 13:51:25 +0200

Dear Dr.Maarten,
Thank you for eye-opening explanation and example. However, I need to have a
second variable to perform ice command. For example, I create a dummy
variable indicating whether she is using a computer at work or not. In order
to use "ice", I need to have a second variable, which I don't have-do I
have?-. In your example, I created many dummy variables to capture a women's
knowledge level. So I am thinking that I am forced to replace missing values
with mean values. Am I right? I think I am missing something here. For
example, for computer dummy variable-let's call computer, what would be the
second variable that I can use?
ice computer ????,saving (temp,replace).
Your explanation raised another question for me: You said that   
"This looks problematic, even without missing data."
Are there alternative ways for the same purpose?

Thank you very much suggestion and explanation.
Best regards,
-----Original Message-----
[] On Behalf Of Maarten buis
Sent: 19 Aralık 2007 Çarşamba 13:03
Subject: Re: st: St: How to handle missing observations in the
factor-principal component analysis 

--- Simo Hansen <> wrote:
> I try to construct knowledge index for women in my data. I have some
> missing observation for the variables that are converted to dummy
> variables to conduct the factor analysis. Could anyone provide a help
> about how to handle those missing observations? 

This looks problematic, even without missing data. At the very bottom I
will suggest a solution to the missing data problem if you still wish
to proceed with this analsys. Consider the logic behind factor

We imagine that there is one or more unobserved variables (f) that
influence the observed variables (x) in a linear way. Say we have three
observed variables (x1, x2, and x3) and one factor (f), and that both
the observed variables and the factor are standardized, so there is no
constant. So, we get the following system:

x1 = l1 f + e1
x2 = l2 f + e2
x3 = l3 f + e3  

We don't observe f, but if we asssume that the errors across equations
are not correled, than all correlation between x1, x2, and x3 is due to
the fact that they have f in common. We use this to reconstruct f.

The problem is that if any of the xs is a dummy than the assumption of
a linear effect of a variable on that x can fail. If you turn a
catorgical or ordinal variable into dummies than you are adding
dependencies between your variables that have nothing to do with the
common factor but are still assigned to that factor.  

> The other question I have that there is a
> following command in SPSS:
> /Missing MeanSub.
> How can I write this command in Stata?

This looks like mean imputation to me. This is a very very bad idea.
Remember that factor analysis uses the correlations between variables
to reconstruct the latent factor. With mean imputation you seriously
distort those correlations. Consider two variables: x1 and x2, where x1
has missing data, which are replaced by the mean. The consequences are
shown in the graphs below:

  |      ***         |      ***
  |    *****         |    *****
  |   *****          |   *****  
x1|  *****           |xx*****xxx 
  | *****            | *****
  |*****             |*****
  |***               |*** 
  ---------------    --------------- 
        x2                  x2

The xs in the right graph are the imputed values, it is clear that they
seriously distort the correlation. In particular this leads to an
underestimation of the correlation.

The help file of -factor- also links to the -impute- command, which
does regression imputation. This too is a bad idea. It puts all the
missing values on the regression line, as is shown in the graphs below,
and thus overestimates the correlation.

  |      ***         |      *x*
  |    *****         |    **x**
  |   *****          |   **x**  
x1|  *****           |  **x** 
  | *****            | **x**
  |*****             |**x**
  |***               |*x* 
  ---------------    --------------- 
        x2                  x2

A better method is to use -ice-, which can be downloaded from -ssc-.
This will preserve the actual correlation, by adding the necesary noice
around the regression line:

  |      ***         |      x**
  |    *****         |    ***x*
  |   *****          |   x****  
x1|  *****           |  ****x  
  | *****            | **x**
  |*****             |***x*
  |***               |x** 
  ---------------    --------------- 

Notice that it is not necesary to let -ice- make multiple imputed
datasets if you are only interested in the point estimates for the
factor scores. The multiple imputations are only used for adjusting the
standard errors. So my suggestion to your missing data problem is: use 
-ice- to create an imputed dataset, and use that dataset to do the
factor analysis. Remember to add -if _mj==1- to the -factor- command 
(-ice- stores your original data on top which is identified by the
value 0 on the variable _mj).

Hope this helps,

Ps. Below is a simulation showing what the different methods do to the

*----------------- begin example -----------------------
set more off
sysuse auto, clear

corr headroom trunk
global true = r(rho)
cd "h:\temp"
capture program drop sim
program sim, rclass
	sysuse auto, clear
	replace headroom = . if uniform() < invlogit(-1 - .1* trunk) 

	impute headroom trunk, gen(headimp)
	corr headimp trunk
	return scalar imp = r(rho) - $true

	sum headroom, meanonly
	gen headmean = cond(missing(headroom),r(mean),headroom)
	corr headmean trunk
	return scalar mean = r(rho) - $true

	ice headroom trunk, saving(temp, replace)
	use temp, clear
	corr headroom trunk if _mj == 1
	return scalar ice = r(rho) - $true
simulate imp=r(imp) mean=r(mean) ice=r(ice), reps(10000) : sim

twoway kdensity imp || kdensity mean || kdensity ice,       ///
legend(order(1 "-impute-" 2 "mean" "imputation" 3 "-ice-")) ///
xtitle("deviation from true correlation")
*------------------ end example ------------------------
(For more on how to use examples I sent to the Statalist, see )

Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting address:
Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index