[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: different approaches to use only observations that have nonmissing
Christopher W. Ryan <email@example.com> asks how the -svy- commands handle
missing values when a subpopulation is identified:
> Using Stata 8 on Win98.
> I'm trying to carryout an analysis of the Health Survey for England 2002
> data. I'm primarily interested in the hyperactivity variable among
> children, from the Strengths and Difficulties Questionnaire. That
> variable is called sdqhyper.
> My subpopulation of interest is kids ages 3-10. Adults of course all
> have missing values on sdqhyper, codes in HSE2002 as some negative
> integer (different ones for different types of missing.)
> Is it better to recode the missings as Stata's missing value (.) and use
> age between 3 and 10 as my subpopulation; or is it better to create a
> subpopulation of kids between 3 and 10 who also have no missing values
> on sdqhyper? These approaches seem to give different results. Here's a
> short do file. Running with -nostop- I think illustrates my dilemma:
> <Stata code omitted>
> I guess my underlying question is, how does Stata handle missing values,
> versus subpopulations, in a -svy- command?
Stata's -svy- estimation commands drop observations that contain missing
values in variables relevant to estimation. Thus if -sdqhyper- contains
missing values, then -svyprop- in Stata 8 (-svy: proportion- in Stata 9) will
drop those observations from the estimation sample. In Chris' example, this
results in a stratum with a single sampling unit.
It seems that the appropriate method for handling this is to generate two
indicator variables. One will identify the subpopulation, the other will
identify valid values within the subpopulation. Given the example Stata 8
code Chris provided, we would suggest the following:
* the -myage- variable identifying the subpopulation already exists, so we are
* left to identify the valid observations within the subpopulation
gen valid = (myage == 1) & (sdqhyper >= 1)
* use -if valid- to properly identify the estimation sample for this analysis
svyprop sdqhyper if valid, subpop(myage)
Notice that this is slightly different than the second method Chris proposed.
In that method, missing values within the subpopulation were treated as not
being in the subpopulation; however, in the above -svyprop- will drop the
'missing' observations within the subpopulation from the estimation sample.
The result could be difference variance estimates and design degrees of
* For searches and help try: