# Re: st: Clustering of secondary units in sampling design

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject Re: st: Clustering of secondary units in sampling design Date Mon, 5 Oct 2009 11:17:24 -0500

```Stata uses an approximation to compute the variance estimator. You can
technically ignore the subsequent stages if you select PSUs with
replacement. As a rule, the actual samples are done without
replacement to increase precision (due to finite population
correction). You can provide Stata with a more accurate design
specification if you know the unit sizes and can supply them with
-fpc()- option; then Stata will start treating the stages for which
-fpc()- is provided as WOR.

Also, there is a difference between variance and its estimator. For
instance, in two-stage sampling, you will have two terms in your
variance expression:

total variance = V1[E2(statistic|stage1)] + E1[V2(statistic|stage1)]

where  the first term involves the cluster means/totals, the second
term involves cluster variances, and the first term is typically
larger than the second. The variance estimator formula based on the
ultimate cluster at the first level wraps two of these terms together,
and shows the variances of the cluster means. What you do in your
design work is to use the theoretical formula with the population
variances (known or accurately estimated from census or CPS data, for
instance). While they may not show up in the estimator of variance,
you will see the differences in different designs if you run
simulations. The two designs you are suggesting will probably have
variances that differ by about 2-5-10%, and that is probably
comparable with the bias of the variance estimator. (This also means
that the approximation that Stata uses works better for some designs
than others; in their textbook, Korn & Graubard
<http://www.citeulike.org/user/ctacmo/article/553280> give some
numeric examples to show where the variance estimator based on the
ultimate clusters makes its hits and misses.)

2009/10/5 Ángel Rodríguez Laso <angelrlaso@gmail.com>:
> Dear Statalisters,
>
> When analysing multistage survey data, Stata only needs a Primary
> Sampling Unit variable in the dataset, because the contribution to
> variance of any further clustering is incorporated to the PSU
> variance. This is based on the 'ultimate cluster method' for variance
> calculation.
>
> Does it mean that, when designing a sample, the number of Secondary
> Sampling Units is irrelevant for standard errors calculation purposes?
> That would mean that if, for example, one selects first municipalities
> (PSUs), then census tracts (SSUs) within municipalities and then
> individuals within census tracts, it is the same to go for a design
> with one census tract with 100 individuals per municipality than to go
> for 10 census tracts with 10 individuals each per municipality,
> although the second way increases the variability in the sample.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```