Holtzman, Deborah <DHoltzman@air.org> asks:
Hi,
"Help svyset" indicates that the following is possible:
Stratified two-stage design, individuals sampled in the second
stage
. svyset su1 [pw=pw], strata(strata) || _n
This is exactly what I am trying to do. However, when I put in this
svyset command (using my own variable names), the results include the
following:
"Note: stage 1 is sampled with replacement, all further stages
will be
ignored"
Deborah goes on to describe how the problem goes away when she
specifies
a finite population correction at the first stage, but that an FPC
is not
appropriate to her study.
Somewhat Long Answer
--------------------
The help file -svyset- is a tad misleading in this example.
Although it is
perfectly fine to type
. svyset su1 [pw=pw], strata(strata) ||
_n (1)
to specify a two-stage study, Stata will proceed to inform you that it
will ignore what you typed for the second stage and simply act as
if you
had merely typed
. svyset su1 [pw=pw], strata(strata)
In essence, Stata is telling you that, because you didn't have an
FPC at the
first stage, the sampling information at the second stage is
irrelevant to
variance estimation. I'll explain this in more detail below.
When you do specify an FPC at the first stage, the sampling
information for
the second stage does matter, and thus Stata will _not_ ignore it
then.
Again, I'll reiterate that there is nothing inherently wrong with
typing
what we did in (1) if indeed that was our design -- it's just that
Stata
likes to emphasize the point that the second stage information is
not needed.
In any case, it would probably be best if we changed the -svyset-
help file
to use a first-stage FPC in this example.
Even Longer Answer
------------------
In the above I state the following fact: When you have a two-stage
survey
design with no FPC at the first stage, the second stage sampling
information
is irrelevant to the determination of correct variance estimates.
To explain why this is the case, I'll first remind you that not
having an FPC
in a survey stage means that you are either sampling with
replacement or that
you are sampling from an infinite population (or a population you
would deem
infinite for all intents and purposes). When this occurs, you can
think of
the sampling of PSUs (clusters) as a series of independent and
identically
distributed draws. That is, what happens on the second cluster
draw has
nothing to do with the first, since the population from which I
draw the
second is identical to the original, since the first cluster was
either
replaced or the population is so big that the deletion of the first
cluster
makes no difference.
With that in mind, consider the following two-stage survey, with
sampling
with replacement at the first stage (no FPC). We'll sample without
replacement at the second stage, but whether we do does not affect
the spirit
of this discussion.
Consider a population of N = 100 PSUs, with each PSU containing 5
people,
implying our total population size is 500. At the first stage, I
will sample
10 PSUs _with_ replacement and then at the second stage I will
sample, within
each of these 10 PSUs, 2 out of the 5 people without replacement.
In the
end my data will consist of 10 PSUs, each of size 2 people for a
total of
twenty observations. I could then -svyset- these data with
something like
. svyset psu1 || _n, fpc(fpc2)
where -psu1- identifies the 10 PSUs and -fpc2- is a variable equal
to 0.4
everywhere (each PSU had a second-stage 2 out of 5, or 40% sample).
If I did this, however, Stata would act like I had merely typed
. svyset psu1
That is, all Stata cares about is the identification of the 10
first-stage
PSUs and does not care that I took 40% samples at the second
stage. Why
does Stata do this, you might ask? To answer this, consider this
alternative
scenario.
In the above population, I have 100 PSUs, each of size 5. Since
there are
10 ways to pick 2 people out of five, I can expand these 100 PSUs
to form
100*10 = 1000 "new PSUs" (NPSUs), each of size 2, representing all
possible
samples of size 2 from each of the 100 groups of 5. I now have a new
popuation of size 1000*2 = 2000 "new people" (each original person
replicated
4 times). I now simply select 10 of these NPSUs with replacement,
and end up
with a dataset consisting of 10 groups of 2 to form 20
observations, same as
before. I then -svyset- these data using
. svyset npsu1
where variable -npsu1- identifies the sampled NPSUs, and I'm set.
In point of fact, there is nothing from a sampling standpoint to
distinguish
the second scenario from the first and, as such, Stata can just
assume your
data came from the second scenario.
Naturally, you may think:
1. The first population has 500 people, the second 2000. Of
course there
is a difference.
This is so, but we are dealing with sampling with replacement
(at the
first stage anyway) meaning that for all intents and purposes the
"population" is infinite. So it doesn't matter.
2. Doesn't the probability of being in the ultimate sample depend on
what happens on the first stage?
Not when you sample with replacement. Because you replace the
first-stage
PSU when you are done with it, that means that any group of 2
from that
PSU (the one initially sampled, and all 9 others) are still in
the running
to be in the sample. In effect, at every draw of a first-stage
PSU, every
possible group of 2 people has the same chance of ending up in
your
sample.
3. Your second population has replicated people. Surely this is
not a
valid population.
It is when you sample with replacement, in which case people
can be in
your sample more than once. The act of "replicating" people
simply
reflects this point.
Of course, the above all breaks down when the first stage has sampling
_without_ replacement, in which case you would specify an FPC and
Stata would
then need to listen to what you have to say about the second
stage. After
all, that's why specific support for multi-stage studies is needed.
One example of how the above breaks down is that when you sample
without
replacement at the first stage, only one possible subgroup from
each primary
PSU can end up in your final sample. That is, take the first PSU
of size 5 in
the population. It may be sampled in the first stage, it may not,
but if it
is sampled, only _one_ of the possible 10 subsamples of size 2 will
end up in
the final sample. This needs to be accounted for, in which case
you need to
know that the second stage consists of a 40% sample.
--Bobby
rgutierrez@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/