Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how to handle missing observations in a regression model


From   Maarten buis <maartenbuis@yahoo.co.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: how to handle missing observations in a regression model
Date   Tue, 5 Sep 2006 11:32:26 +0100 (BST)

--- Simo Hansen <simohansen@gmail.com> wrote:
> I am using moter's years of schooling and father's years of schooling
> as explanatory variables in my regression model. I also creata a 
> dummy indicators for whether mother's and father's years of schooling
> are missing. <snip> When I run the following regression:
> reg childedyrs dadeduc moteduc misdaded mismoted, 
> Stata drops two dummy indicators for whether parents' schooling is
> missing.

Stata (and most other stats packages) ignore observations with missing
values on either the dependent or independent variables. So Stata sees
the variabel misdaded and mismoted only if both are observed and in
that case the dummies will only have the values 0, and is thus a
constant and is thus dropped. The conventional way of dealing with this
is to replace daded and moded with the mean value if it is missing.
Supposedly the dummies no measure how much the child's education
deviates from the mean if the child has missing values on mother's and
father's education respectively. 

However, this approach leads to biased estimates and your problem is
likely to be a worst case scenario for this approach. For simplicity,
assume that mother's education is completely observed, so the
regression equation is:
childedyrs = b0 + b1*dadeduc + b2*moteduc + b3*misdaded 

if father's education is observed the regression becomes:
childedyrs = b0 + b1*dadeduc + b2*moteduc + b3*0
childedyrs = b0 + b1*dadeduc + b2*moteduc 

if father's education is missing the regression equation becomes:
childedyrs = b0 + b1*dadeduc + b2*moteduc + b3*1
but now notice that dadeduc is now a constant: for these cases they
were all replaced by the mean value so we now have a constant equal to
b0 + b1*dadeduc + b3. Call this constant b0'. So we can rewrite the
regression equation as:
childedyrs = b0' + b2*moteduc 

So the effect of mother's education is the effect controlled for
father's education if father's education is observed, and the effect
not controlled for father's education if father's education is not
observed. The parameter you will find is some weighted average of these
two effects. The ``uncontrolled'' effect gets more weight as the
proportion of missing values increases. The ``controlled'' and
``uncontrolled'' effects are more different if father's and mother's
education are more correlated. In my experience the proportion of
missing values in father's and mother's education tends to be pretty
high and the correlation of levels of education between partners is
amonght the highest nontrivial correlation produced by social
processes. So your problem is a worst case scenario for this method.

To controll for missing values you could do multiple imputation with 
-ice-. Another option is to use -hotdeck-

HTH,
Maarten






-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting adress:
Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

http://home.fsw.vu.nl/m.buis/
-----------------------------------------


	
	
		
___________________________________________________________ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index