# Re: st: how to handle missing observations in a regression model

 From Maarten buis To statalist@hsphsun2.harvard.edu Subject Re: st: how to handle missing observations in a regression model Date Tue, 5 Sep 2006 11:32:26 +0100 (BST)

```--- Simo Hansen <simohansen@gmail.com> wrote:
> I am using moter's years of schooling and father's years of schooling
> as explanatory variables in my regression model. I also creata a
> dummy indicators for whether mother's and father's years of schooling
> are missing. <snip> When I run the following regression:
> Stata drops two dummy indicators for whether parents' schooling is
> missing.

Stata (and most other stats packages) ignore observations with missing
values on either the dependent or independent variables. So Stata sees
the variabel misdaded and mismoted only if both are observed and in
that case the dummies will only have the values 0, and is thus a
constant and is thus dropped. The conventional way of dealing with this
is to replace daded and moded with the mean value if it is missing.
Supposedly the dummies no measure how much the child's education
deviates from the mean if the child has missing values on mother's and
father's education respectively.

likely to be a worst case scenario for this approach. For simplicity,
assume that mother's education is completely observed, so the
regression equation is:

if father's education is observed the regression becomes:
childedyrs = b0 + b1*dadeduc + b2*moteduc + b3*0
childedyrs = b0 + b1*dadeduc + b2*moteduc

if father's education is missing the regression equation becomes:
childedyrs = b0 + b1*dadeduc + b2*moteduc + b3*1
but now notice that dadeduc is now a constant: for these cases they
were all replaced by the mean value so we now have a constant equal to
b0 + b1*dadeduc + b3. Call this constant b0'. So we can rewrite the
regression equation as:
childedyrs = b0' + b2*moteduc

So the effect of mother's education is the effect controlled for
father's education if father's education is observed, and the effect
not controlled for father's education if father's education is not
observed. The parameter you will find is some weighted average of these
two effects. The ``uncontrolled'' effect gets more weight as the
proportion of missing values increases. The ``controlled'' and
``uncontrolled'' effects are more different if father's and mother's
education are more correlated. In my experience the proportion of
missing values in father's and mother's education tends to be pretty
high and the correlation of levels of education between partners is
amonght the highest nontrivial correlation produced by social
processes. So your problem is a worst case scenario for this method.

To controll for missing values you could do multiple imputation with
-ice-. Another option is to use -hotdeck-

HTH,
Maarten

-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

http://home.fsw.vu.nl/m.buis/
-----------------------------------------

___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```