# Re: Re: st: Clustering of secondary units in sampling design

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject Re: Re: st: Clustering of secondary units in sampling design Date Fri, 9 Oct 2009 09:36:25 -0500

```On Fri, Oct 9, 2009 at 4:04 AM,  <angelrlaso@gmail.com> wrote:
> "total variance = V1[E2(statistic|stage1)] + E1[V2(statistic|stage1)]"
>
> I don't know what V1 V2 E1 E2 stand for.

Oops, should've defined these better. E1/V1 are the expected value and
variance with respect to the first stage sampling, and E2/V2 are the
expected value and variance with respect to the second stage sampling,
conditional on the PSUs selected in the first stage. What book on
sampling do you have at hand? I will refer you to the specific
formula.

> "What you do in your
> design work is to use the theoretical formula with the population
> variances (known or accurately estimated from census or CPS data, for
> instance). While they may not show up in the estimator of variance,
> you will see the differences in different designs if you run
> simulations."
>
> If one doesn't see the difference in the estimator of variance, where does
> one see the difference when running simulations?
>
> I imagine the design with one census tract with 100 individuals will have
> less variance than the one with 10 census tracts with 10 individuals each.
> The first option should be corrected more strongly in the analysis because
> it is farther from a simple random sampling.

This is exactly what I would expect to see in the simulations.

Simulations with finite populations are somewhat tricky.

1. Create a large finite population, roughly of the size and structure
of the one you will be working with. Compute your parameters of
interest (true-theta: means, ratios, correlations, regression
parameters) -- those would be the targets of inference.

2. For r=1, ... R = #replications, repeat:

2a. Take a finite sample from your population according to your
design(s). That would be some sort of -sample- command in Stata,
although with several stages of selection, that's quite a bit more
tricky. I had to use a sequence of -sample- and -merge- when producing
my multiple stage samples.

2b. svyset your data according to the design (you may have to compute
probabilities of selection; 1/Prob[selection] = pweight).

2c. Estimate your parameters of interest; for each parameter, store
both its estimate theta-hat_r and its estimated standard error
se-theta-hat_r.

3. Once you are done with your sampling, compute the summaries.

3a. Bias[ theta-hat ] = average theta-hat - true-theta

Rationale: we know that even the means can be biased in designs with
random number of observations.

3b. Var[ theta-hat ] = 1/R times sum over r ( theta-hat_r - average
theta-hat )^2

3c. MSE[ theta-hat ] = 1/R times sum over r ( theta-hat_r - true-theta )^2

Rationale: we want to compare the designs in terms of efficiency.
Those are Monte Carlo approximations to the true design variance or
MSE.

3d. Bias of the variance estimator[ se-theta-hat ] = average
(se-theta-hat_r^2) - Var[ theta-hat ]

Rationale: we know that the variance estimators are almost always
biased. Even the direction of bias is not always known.

3e. Stability of the variance estimator [se-theta-hat] = 1/R times sum
over r ( se-theta-hat_r -sqrt{ MSE[ theta-hat ] } )^2 / MSE[ theta-hat
]

Rationale: linearization and the jackknife estimators tend to be more
stable than BRR or the bootstrap estimators. For the same amount of
bias, less stable estimators usually produce confidence intervals that
will lack coverage.

3f. Coverage = 1/R times sum over r of indicator { theta-hat_r + 1.96
se-theta-hat_r < true-theta}

This is coverage of the upper limit; define the coverage at the lower
limit and two-sided coverage accordingly. Indicator is the function
that takes value of 1 if the event in { } is true and 0 otherwise; in
Stata, you can simply write

gen byte coverage_UL = ( theta_hat + 1.96*se_theta_hat > true_theta )

(provided the quantities on the RHS are defined, of course). Then

sum coverage_UL

will give you the percentage of misses by the upper limit. Of course
1.96 here stands for the 95% normal CI; if you prefer to use
t-distribution, code something different.

Rationale: in the end, we want to use the standard errors to decide if
something is "significant" by conducting a formal test. We want to
know whether what we think to be a 10% test (and 90% confidence
interval) really produces 10% of the samples with type I error. In
practice, one would often see coverages like 8% on the upper limit and
3% on the lower limit that roughly balance to the two-sided 10%.

Those would be the (minimal set of) summaries I would keep track of in
my simulations. Performance of your estimators (both the point
estimates and standard errors) is a complicated function of both the
design and the particular estimation procedure. For instance, if you
are estimating a total of Y, you can have a direct estimator, the sum
of [Y*sampling weight], or you can compute a ratio estimator if you
have additional information and a reliable covariate (known total of X
/ sample total of X * sample total of Y). They will have different
properties in terms of bias and variance, and different estimators may
work better for some designs than for others. Likewise, to estimate
variances, you can use linearization or the jackknife or the
bootstrap, etc., depending on what your design allows you to do.

Now, the differences in the designs that you asked about should be
reflected in Var[ theta-hat ] and MSE[ theta-hat ]. As I said, you can
have the same approximation used in the variance estimator for
different designs, but it does not mean that the underlying variances
being estimated are the same; they most likely aren't. I would too
think that the design with SRS at SSU level would produce smaller
variance than a design with an additional sampling stage. But what I
think will happen is that the difference in those variances will be of
the same order of magnitude as the amount of error you make by using
the particular variance esimator:

abs( MSE[ theta-hat; design 1] - MSE[ theta-hat; design 2] )
approximately equal or less than abs( Bias[ se-theta-hat ; design 1 ]
) approximately equal abs( Bias[ se-theta-hat; design 2] )

where I explicitly add dependence on the design to my simulation summaries.

Hope this clarifies things a little bit. (Should I write a SJ article
on this? :))

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```