Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: svy subpop option and e(sample) |

Date |
Wed, 25 May 2011 11:35:15 -0500 |

On Wed, May 25, 2011 at 10:10 AM, Richard Williams <richardwilliams.ndu@gmail.com> wrote: > As a sidelight, one of the things that has always bothered me about subpop > is that you are apparently never supposed to create an extract from your > data, e.g. you could have 100 million cases and only be interested in a > subpopulation of 10,000, but you are nonetheless supposed to keep all 100 > million cases in your data set so the standard errors are right. I always > wonder how horrible it would be if you just made the extract or used -if- > instead of subpop. If, say, the standard errors might be off by .01%, I > suspect I could live with that. If you have 100M cases, it is called a census ;). See http://stata-journal.com/article.html?article=st0153. My understanding of this (quite neat) article is that you are OK in few selected situations: when your subpop == a stratum or a union of several strata, or subpop cuts through all PSUs (i.e., every PSU has a member from the subpopulation, so subsetting with -if- does not kill any sampling units). That way, subsetting the data by -if- still produces design-consistent standard errors. Read the article, though. If you have a design that's more complicated than the standardized one (stratified, two-stage clustered with replacement, as -webuse nhanes2- is), things will get more complicated. The bottom line is, YOU MUST HAVE OVERWHELMINGLY STRONG REASONS TO SUBSET YOUR DATA WITH IF instead of using -subpop()- that is always appropriate. Going down from 100M observations to 10K observations is not a very convincing reason to me, frankly. The subset used for subpop is passed through (-passthru-ed?) in e(subpop), so your predicted probabilities can be restricted to the subpopulation with predict whatever `e(subpop)' , options or predict whatever if `e(subpop)' , options depending on how -subpop()- option was specified. If you had no -subpop()-, then of course it will be empty, so the things should work out fine for you. webuse nhanes2, clear svy : logit highbp age * specify the subpop as the -if- condition svy, subpop(if diabetes==1) : logit highbp age est store logit1 predict prob1 `e(subpop)', pr * specify the subpop as the 0/1 variable svy, subpop(diabetes) : logit highbp age est store logit2 predict prob2 if `e(subpop)', pr * why the heck are they different??? Because -diabetes- has missing values! compare prob1 prob2 * is subsetting wrong here? It might be OK. svy : logit highbp age if diabetes == 1 est store logit3 est tab logit1 logit2 logit3, se P.S. I agree with Steve that this is the expected behavior of -svy- and -e(sample)-, and I wouldn't want them to work otherwise. -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: svy subpop option and e(sample)***From:*Steven Samuels <sjsamuels@gmail.com>

**References**:**st: svy subpop option and e(sample)***From:*Richard Williams <richardwilliams.ndu@gmail.com>

**Re: st: svy subpop option and e(sample)***From:*Steven Samuels <sjsamuels@gmail.com>

**Re: st: svy subpop option and e(sample)***From:*Richard Williams <richardwilliams.ndu@gmail.com>

- Prev by Date:
**st: Optimal RD Bandwidth Choice also for Rectangular Kernel?** - Next by Date:
**Re: st: auto macro for name of do file?** - Previous by thread:
**Re: st: svy subpop option and e(sample)** - Next by thread:
**Re: st: svy subpop option and e(sample)** - Index(es):