[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how to handle missing observations in a regression model

From	Joseph Coveney <[email protected]>
To	Statalist <[email protected]>
Subject	Re: st: how to handle missing observations in a regression model
Date	Tue, 05 Sep 2006 19:56:17 +0900

Simo Hansen wrote:

I am using moter's years of schooling and father's years of schooling as
explanatory variables in my regression model. I also creata a dummy
indicators for whether mother's and father's years of schooling are missing,
respectively:
gen misdaded=dadedec==.
gen mismoted=moteduc==.
When I run the following regression:
reg childedyrs dadeduc moteduc misdaded mismoted,
Stata drops two dummy indicators for whether parents' schooling is missing.
Do you have any suggestion on how I can properly control for whether
mother's and father's years of schooling are missing in my regression model.

--------------------------------------------------------------------------------

As I recall (caution!), there was a technical report from the BMD/BMDP
organization that described this approach of using indicator variables to
flag missing predictors.  You substitute an arbitrary constant (say, zero)
for the missing values and flag the missing value with a dummy variable.

This approach came up on Statalist a while ago, too.  The upshot from the
reply postings was not to do this.

You can explore the behavior of this approach using -simulate- with a
data-generating process that mimics what you expect prevails in your study.
(This includes the mechanism of missingness.)  A rudimentary example of this
is shown below.  It has 5% randomly missing in both predictors.  The results
indicate that for this approach, compared to just listwise deletion, there
is reduced power and an overly conservative Type I error rate, and (under
the alternative hypothesis) bias in the estimates.  If the standard
deviation of the estimates is any indication, then there could be some
problem with the estimator being consistent, too.  Other than these picayune
annoyances, it seems okay, though.

As an alternative, you might want to try some recognized method of imputing
the missing values.  See -hotdeck- or -ice-, for example.

Joseph Coveney

clear
set more off
set seed `=date("2006-09-04", "ymd")'
*
capture program drop simem
program define simem, rclass
    syntax , [DAD_contribution(real 0.5)]
    tempname mom dad con
    replace dadeduc = 6 + floor(14 * uniform())
    replace moteduc = 6 + floor(14 * uniform())
    local mom_contribution = 1.0 - `dad_contribution'
    replace childedyrs = min(20, ///
      round(`dad_contribution' * dadeduc + ///
      `mom_contribution' * moteduc + 1 + ///
      invnorm(uniform()), 1))
    regress childedyrs moteduc dadeduc
    scalar `dad' = _b[dadeduc]
    scalar `mom' = _b[moteduc]
    scalar `con' = _b[_cons]
    test moteduc = dadeduc
    return scalar complete = ( r(p) < 0.05 )
    replace dadeduc = . if uniform() > 0.95
    replace moteduc = . if uniform() > 0.95
    regress childedyrs moteduc dadeduc
    return scalar dad_listwise = scalar(`dad') - _b[dadeduc]
    return scalar mom_listwise = scalar(`mom') - _b[moteduc]
    return scalar con_listwise = scalar(`con') - _b[_cons]
    test moteduc = dadeduc
    return scalar listwise = ( r(p) < 0.05 )
    foreach var of varlist dad mot {
        replace mis`var' = mi(`var')
        replace `var' = 0 if mi(`var')
    }
    regress childedyrs moteduc mismoteduc dadeduc misdadeduc
    return scalar dad_bmd = scalar(`dad') - _b[dadeduc]
    return scalar mom_bmd = scalar(`mom') - _b[moteduc]
    return scalar con_bmd = scalar(`con') - _b[_cons]
    test moteduc = dadeduc
    return scalar bmd = ( r(p) < 0.05 )
    scalar drop `mom' `dad' `con'
end
*
foreach contribution in 0.5 0.525 0.55 {
    clear
    quietly {
        set obs 200
        generate byte dadeduc = .
        generate byte moteduc = .
        generate byte childedyrs = .
        generate byte misdadeduc = .
        generate byte mismoteduc = .
    }
    simulate complete = r(complete) listwise = r(listwise) ///
      bmd = r(bmd) dad_listwise = r(dad_listwise) ///
      mom_listwise = r(mom_listwise) ///
      con_listwise = r(con_listwise) ///
      dad_bmd = r(dad_bmd) mom_bmd = r(mom_bmd) ///
      con_bmd = r(con_bmd), reps(3000) nodots nolegend: ///
      simem , dad(`contribution')
    summarize complete listwise bmd
    summarize dad_listwise dad_bmd mom_listwise mom_bmd ///
      con_listwise con_bmd, separator(2)
}
exit


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: how to handle missing observations in a regression model
  - From: Richard Williams <[email protected]>

Prev by Date: Re: st: how to handle missing observations in a regression model
Next by Date: Re: st: 'Normal' condition number vs GLAMM condition number
Previous by thread: Re: st: how to handle missing observations in a regression model
Next by thread: Re: st: how to handle missing observations in a regression model
Index(es):
- Date
- Thread