After puzzling a long time to track down a problem in some output I was
getting, I have discovered something counter-intuitive about how e(sample)
is defined after survey estimation commands.
When survey estimation commands are used with the subpop() option,
e(sample) does not take this into account. In particular after:
svyregress ...., subpop(touse)
assert touse if e(sample) will, in general, fail.
It appears, rather, that e(sample) is defined exactly as it would be if the
subpop option were not invoked: it marks all observations which contain
complete data on the regression variables, regardless of whether they are
in the subpopulation or not.
The implication is that a subsequent command like:
predict ... if e(sample)
will not give you what you might expect: it does out-of-sample prediction!
To get an in-simple prediction requires
predict ... if e(sample) & touse==1
Similar considerations apply to graphing and other non svy-command
operations that one might want to apply only in-sample.
I imagine that e(sample) is set up this way because in survey estimation,
unlike other estimation commands, observations outside the subpop() _do_
contribute to the calculation of e(V) because of design effects. From this
perspective, this implementation of e(sample) makes sense.
Obviously the workaround is simple enough, but the behavior of e(sample)
seemed at-first counter-intuitive and took me by surprise. I'm posting
this not so much as a complaint as to alert others who might trip over this
same phenomenon.
Pax vobiscum.
Clyde Schechter
Dept. of Family Medicine & Community Health
Albert Einstein College of Medicine
Bronx, NY, USA
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/