Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: svyset question

From   David Bell <[email protected]>
To   [email protected]
Subject   Re: st: svyset question
Date   Fri, 2 Jun 2006 15:47:54 -0400

Thank you, Roberto, for that very educational reply. I have had a similar consternation to Deborah's, although I'm not sure it is for the same reason.

What is the difference in your example, Roberto, if the clusters are of different sizes or if the number of observations within each cluster is variable essentially (or arguably) at random?

Stata seems to be in computing variances so that the confidence intervals around point estimates can be estimated for the population. This would be quite important if I were a pollster, but I'm seldom really interested in population estimates for variables. In the causal models that concern me, I don't want the 5 members of cluster 1 to have 5 times the weight as the single member of cluster 2. I'm looking to use the design specification to adjust for these kinds of weighting issues.

Your example with exactly 2 observations in each cluster is not relevant for this issue because it assumes that because clusters are sampled with replacement and equal probability, then each subsample must have equal probability. But in the uneven cluster size case, the subsamples do not seem to be in fact equally probable. It was (in my case) that I was trying to specify a second stage in order to notify Stata of the overweighting of observations in larger clusters. When Stata informed me that my second stage sampling design was unimportant, I felt like I was back in the SPSS days

My memory of using SAS for clustered regression is that it was willing to let me specify a second stage design even if the first stage was sampled with equal (or unspecified) probability.

Dave Bell

On Jun 2, 2006, at 12:40 PM, Roberto G. Gutierrez, StataCorp wrote:

Holtzman, Deborah <[email protected]> asks:


"Help svyset" indicates that the following is possible:

Stratified two-stage design, individuals sampled in the second stage
. svyset su1 [pw=pw], strata(strata) || _n

This is exactly what I am trying to do. However, when I put in this
svyset command (using my own variable names), the results include the

"Note: stage 1 is sampled with replacement, all further stages will be
Deborah goes on to describe how the problem goes away when she specifies
a finite population correction at the first stage, but that an FPC is not
appropriate to her study.

Somewhat Long Answer

The help file -svyset- is a tad misleading in this example. Although it is
perfectly fine to type

. svyset su1 [pw=pw], strata(strata) || _n (1)

to specify a two-stage study, Stata will proceed to inform you that it
will ignore what you typed for the second stage and simply act as if you
had merely typed

. svyset su1 [pw=pw], strata(strata)

In essence, Stata is telling you that, because you didn't have an FPC at the
first stage, the sampling information at the second stage is irrelevant to
variance estimation. I'll explain this in more detail below.

When you do specify an FPC at the first stage, the sampling information for
the second stage does matter, and thus Stata will _not_ ignore it then.

Again, I'll reiterate that there is nothing inherently wrong with typing
what we did in (1) if indeed that was our design -- it's just that Stata
likes to emphasize the point that the second stage information is not needed.

In any case, it would probably be best if we changed the -svyset- help file
to use a first-stage FPC in this example.

Even Longer Answer

In the above I state the following fact: When you have a two-stage survey
design with no FPC at the first stage, the second stage sampling information
is irrelevant to the determination of correct variance estimates.

To explain why this is the case, I'll first remind you that not having an FPC
in a survey stage means that you are either sampling with replacement or that
you are sampling from an infinite population (or a population you would deem
infinite for all intents and purposes). When this occurs, you can think of
the sampling of PSUs (clusters) as a series of independent and identically
distributed draws. That is, what happens on the second cluster draw has
nothing to do with the first, since the population from which I draw the
second is identical to the original, since the first cluster was either
replaced or the population is so big that the deletion of the first cluster
makes no difference.

With that in mind, consider the following two-stage survey, with sampling
with replacement at the first stage (no FPC). We'll sample without
replacement at the second stage, but whether we do does not affect the spirit
of this discussion.

Consider a population of N = 100 PSUs, with each PSU containing 5 people,
implying our total population size is 500. At the first stage, I will sample
10 PSUs _with_ replacement and then at the second stage I will sample, within
each of these 10 PSUs, 2 out of the 5 people without replacement. In the
end my data will consist of 10 PSUs, each of size 2 people for a total of
twenty observations. I could then -svyset- these data with something like

. svyset psu1 || _n, fpc(fpc2)

where -psu1- identifies the 10 PSUs and -fpc2- is a variable equal to 0.4
everywhere (each PSU had a second-stage 2 out of 5, or 40% sample).

If I did this, however, Stata would act like I had merely typed

. svyset psu1

That is, all Stata cares about is the identification of the 10 first-stage
PSUs and does not care that I took 40% samples at the second stage. Why
does Stata do this, you might ask? To answer this, consider this alternative

In the above population, I have 100 PSUs, each of size 5. Since there are
10 ways to pick 2 people out of five, I can expand these 100 PSUs to form
100*10 = 1000 "new PSUs" (NPSUs), each of size 2, representing all possible
samples of size 2 from each of the 100 groups of 5. I now have a new
popuation of size 1000*2 = 2000 "new people" (each original person replicated
4 times). I now simply select 10 of these NPSUs with replacement, and end up
with a dataset consisting of 10 groups of 2 to form 20 observations, same as
before. I then -svyset- these data using

. svyset npsu1

where variable -npsu1- identifies the sampled NPSUs, and I'm set.

In point of fact, there is nothing from a sampling standpoint to distinguish
the second scenario from the first and, as such, Stata can just assume your
data came from the second scenario.

Naturally, you may think:

1. The first population has 500 people, the second 2000. Of course there
is a difference.

This is so, but we are dealing with sampling with replacement (at the
first stage anyway) meaning that for all intents and purposes the
"population" is infinite. So it doesn't matter.

2. Doesn't the probability of being in the ultimate sample depend on
what happens on the first stage?

Not when you sample with replacement. Because you replace the first-stage
PSU when you are done with it, that means that any group of 2 from that
PSU (the one initially sampled, and all 9 others) are still in the running
to be in the sample. In effect, at every draw of a first-stage PSU, every
possible group of 2 people has the same chance of ending up in your

3. Your second population has replicated people. Surely this is not a
valid population.

It is when you sample with replacement, in which case people can be in
your sample more than once. The act of "replicating" people simply
reflects this point.

Of course, the above all breaks down when the first stage has sampling
_without_ replacement, in which case you would specify an FPC and Stata would
then need to listen to what you have to say about the second stage. After
all, that's why specific support for multi-stage studies is needed.

One example of how the above breaks down is that when you sample without
replacement at the first stage, only one possible subgroup from each primary
PSU can end up in your final sample. That is, take the first PSU of size 5 in
the population. It may be sampled in the first stage, it may not, but if it
is sampled, only _one_ of the possible 10 subsamples of size 2 will end up in
the final sample. This needs to be accounted for, in which case you need to
know that the second stage consists of a 40% sample.

[email protected]
* For searches and help try:
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index