[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: svyset question
Thank you, Roberto, for that very educational reply. I have had a
similar consternation to Deborah's, although I'm not sure it is for
the same reason.
What is the difference in your example, Roberto, if the clusters are
of different sizes or if the number of observations within each
cluster is variable essentially (or arguably) at random?
Stata seems to be in computing variances so that the confidence
intervals around point estimates can be estimated for the
population. This would be quite important if I were a pollster, but
I'm seldom really interested in population estimates for variables.
In the causal models that concern me, I don't want the 5 members of
cluster 1 to have 5 times the weight as the single member of cluster
2. I'm looking to use the design specification to adjust for these
kinds of weighting issues.
Your example with exactly 2 observations in each cluster is not
relevant for this issue because it assumes that because clusters are
sampled with replacement and equal probability, then each subsample
must have equal probability. But in the uneven cluster size case,
the subsamples do not seem to be in fact equally probable. It was
(in my case) that I was trying to specify a second stage in order to
notify Stata of the overweighting of observations in larger
clusters. When Stata informed me that my second stage sampling
design was unimportant, I felt like I was back in the SPSS days
My memory of using SAS for clustered regression is that it was
willing to let me specify a second stage design even if the first
stage was sampled with equal (or unspecified) probability.
On Jun 2, 2006, at 12:40 PM, Roberto G. Gutierrez, StataCorp wrote:
Holtzman, Deborah <DHoltzman@air.org> asks:
"Help svyset" indicates that the following is possible:
Stratified two-stage design, individuals sampled in the second
. svyset su1 [pw=pw], strata(strata) || _n
This is exactly what I am trying to do. However, when I put in this
svyset command (using my own variable names), the results include the
"Note: stage 1 is sampled with replacement, all further stages
Deborah goes on to describe how the problem goes away when she
a finite population correction at the first stage, but that an FPC
appropriate to her study.
Somewhat Long Answer
The help file -svyset- is a tad misleading in this example.
Although it is
perfectly fine to type
. svyset su1 [pw=pw], strata(strata) ||
to specify a two-stage study, Stata will proceed to inform you that it
will ignore what you typed for the second stage and simply act as
had merely typed
. svyset su1 [pw=pw], strata(strata)
In essence, Stata is telling you that, because you didn't have an
FPC at the
first stage, the sampling information at the second stage is
variance estimation. I'll explain this in more detail below.
When you do specify an FPC at the first stage, the sampling
the second stage does matter, and thus Stata will _not_ ignore it
Again, I'll reiterate that there is nothing inherently wrong with
what we did in (1) if indeed that was our design -- it's just that
likes to emphasize the point that the second stage information is
In any case, it would probably be best if we changed the -svyset-
to use a first-stage FPC in this example.
Even Longer Answer
In the above I state the following fact: When you have a two-stage
design with no FPC at the first stage, the second stage sampling
is irrelevant to the determination of correct variance estimates.
To explain why this is the case, I'll first remind you that not
having an FPC
in a survey stage means that you are either sampling with
replacement or that
you are sampling from an infinite population (or a population you
infinite for all intents and purposes). When this occurs, you can
the sampling of PSUs (clusters) as a series of independent and
distributed draws. That is, what happens on the second cluster
nothing to do with the first, since the population from which I
second is identical to the original, since the first cluster was
replaced or the population is so big that the deletion of the first
makes no difference.
With that in mind, consider the following two-stage survey, with
with replacement at the first stage (no FPC). We'll sample without
replacement at the second stage, but whether we do does not affect
of this discussion.
Consider a population of N = 100 PSUs, with each PSU containing 5
implying our total population size is 500. At the first stage, I
10 PSUs _with_ replacement and then at the second stage I will
each of these 10 PSUs, 2 out of the 5 people without replacement.
end my data will consist of 10 PSUs, each of size 2 people for a
twenty observations. I could then -svyset- these data with
. svyset psu1 || _n, fpc(fpc2)
where -psu1- identifies the 10 PSUs and -fpc2- is a variable equal
everywhere (each PSU had a second-stage 2 out of 5, or 40% sample).
If I did this, however, Stata would act like I had merely typed
. svyset psu1
That is, all Stata cares about is the identification of the 10
PSUs and does not care that I took 40% samples at the second
does Stata do this, you might ask? To answer this, consider this
In the above population, I have 100 PSUs, each of size 5. Since
10 ways to pick 2 people out of five, I can expand these 100 PSUs
100*10 = 1000 "new PSUs" (NPSUs), each of size 2, representing all
samples of size 2 from each of the 100 groups of 5. I now have a new
popuation of size 1000*2 = 2000 "new people" (each original person
4 times). I now simply select 10 of these NPSUs with replacement,
and end up
with a dataset consisting of 10 groups of 2 to form 20
observations, same as
before. I then -svyset- these data using
. svyset npsu1
where variable -npsu1- identifies the sampled NPSUs, and I'm set.
In point of fact, there is nothing from a sampling standpoint to
the second scenario from the first and, as such, Stata can just
data came from the second scenario.
Naturally, you may think:
1. The first population has 500 people, the second 2000. Of
is a difference.
This is so, but we are dealing with sampling with replacement
first stage anyway) meaning that for all intents and purposes the
"population" is infinite. So it doesn't matter.
2. Doesn't the probability of being in the ultimate sample depend on
what happens on the first stage?
Not when you sample with replacement. Because you replace the
PSU when you are done with it, that means that any group of 2
PSU (the one initially sampled, and all 9 others) are still in
to be in the sample. In effect, at every draw of a first-stage
possible group of 2 people has the same chance of ending up in
3. Your second population has replicated people. Surely this is
It is when you sample with replacement, in which case people
can be in
your sample more than once. The act of "replicating" people
reflects this point.
Of course, the above all breaks down when the first stage has sampling
_without_ replacement, in which case you would specify an FPC and
then need to listen to what you have to say about the second
all, that's why specific support for multi-stage studies is needed.
One example of how the above breaks down is that when you sample
replacement at the first stage, only one possible subgroup from
PSU can end up in your final sample. That is, take the first PSU
of size 5 in
the population. It may be sampled in the first stage, it may not,
but if it
is sampled, only _one_ of the possible 10 subsamples of size 2 will
end up in
the final sample. This needs to be accounted for, in which case
you need to
know that the second stage consists of a 40% sample.
* For searches and help try:
* For searches and help try: