Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: svy subpop option and e(sample)

From	Stas Kolenikov <[email protected]>
To	[email protected]
Subject	Re: st: svy subpop option and e(sample)
Date	Wed, 25 May 2011 11:35:15 -0500

On Wed, May 25, 2011 at 10:10 AM, Richard Williams
<[email protected]> wrote:
> As a sidelight, one of the things that has always bothered me about subpop
> is that you are apparently never supposed to create an extract from your
> data, e.g. you could have 100 million cases and only be interested in a
> subpopulation of 10,000, but you are nonetheless supposed to keep all 100
> million cases in your data set so the standard errors are right. I always
> wonder how horrible it would be if you just made the extract or used -if-
> instead of subpop. If, say, the standard errors might be off by .01%, I
> suspect I could live with that.

If you have 100M cases, it is called a census ;).

See http://stata-journal.com/article.html?article=st0153. My
understanding of this (quite neat) article is that you are OK in few
selected situations: when your subpop == a stratum or a union of
several strata, or subpop cuts through all PSUs (i.e., every PSU has a
member from the subpopulation, so subsetting with -if- does not kill
any sampling units). That way, subsetting the data by -if- still
produces design-consistent standard errors. Read the article, though.
If you have a design that's more complicated than the standardized one
(stratified, two-stage clustered with replacement, as -webuse nhanes2-
is), things will get more complicated. The bottom line is, YOU MUST
HAVE OVERWHELMINGLY STRONG REASONS TO SUBSET YOUR DATA WITH IF instead
of using -subpop()- that is always appropriate. Going down from 100M
observations to 10K observations is not a very convincing reason to
me, frankly.

The subset used for subpop is passed through (-passthru-ed?) in
e(subpop), so your predicted probabilities can be restricted to the
subpopulation with

predict whatever `e(subpop)' , options

or

predict whatever if `e(subpop)' , options

depending on how -subpop()- option was specified. If you had no
-subpop()-, then of course it will be empty, so the things should work
out fine for you.

webuse nhanes2, clear
svy : logit highbp age
* specify the subpop as the -if- condition
svy, subpop(if diabetes==1) : logit highbp age
est store logit1
predict prob1 `e(subpop)', pr
* specify the subpop as the 0/1 variable
svy, subpop(diabetes) : logit highbp age
est store logit2
predict prob2 if `e(subpop)', pr
* why the heck are they different??? Because -diabetes- has missing values!
compare prob1 prob2
* is subsetting wrong here? It might be OK.
svy : logit highbp age if diabetes == 1
est store logit3
est tab logit1 logit2 logit3, se

P.S. I agree with Steve that this is the expected behavior of -svy-
and -e(sample)-, and I wouldn't want them to work otherwise.

-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: svy subpop option and e(sample)
  - From: Steven Samuels <[email protected]>

References:
- st: svy subpop option and e(sample)
  - From: Richard Williams <[email protected]>
- Re: st: svy subpop option and e(sample)
  - From: Steven Samuels <[email protected]>
- Re: st: svy subpop option and e(sample)
  - From: Richard Williams <[email protected]>

Prev by Date: st: Optimal RD Bandwidth Choice also for Rectangular Kernel?
Next by Date: Re: st: auto macro for name of do file?
Previous by thread: Re: st: svy subpop option and e(sample)
Next by thread: Re: st: svy subpop option and e(sample)
Index(es):
- Date
- Thread