# Re: st: cluster and F test

 From "Stas Kolenikov" To statalist@hsphsun2.harvard.edu Subject Re: st: cluster and F test Date Tue, 15 Jul 2008 09:49:28 -0500

On 7/11/08, Ángel Rodríguez Laso <angelrlaso@gmail.com> wrote:
>  So the standard error is calculated on the effective sample size (16648;
>  p(1-p)/se*se) that, if corrected by deft*deft becomes
>  (16648*0.855246*0.855246) 12177, much closer to the number of
>  observations than to the number of clusters.

That's not how the standard errors and design effects are computed.
The primary computation is the variance of the estimates (using
svy-appropriate estimator, in your case Taylor series linearization,
also coming out of -robust- and -cluster- options in non-svy
commands). Then the design effects are computed as a secondary measure
comparing the resulting standard errors to the "ideal" ones that would
be obtained from an SRS sample.

As Steven said, in cluster designs, there is no single total sample
size that you are using to achieve a given power or precision of your
estimates. For a sample size of 1000, you might have 40 clusters with
25 observations, or 500 clusters with 2 observations, or 10 clusters
with 50 observations plus another 5 clusters with 100 observations.
They will all likely give quite different design effects. I'd
encourage you to think about minimizing the variance Steven gives in
the end of his email with respect to n and m, given the (average)
within and between variances S_w^2 and S_b^2 that you need to know.
Same thinking applies to proportions where observations are 0s and 1s.

Going back to the issue of the degrees of freedom of the design --
this is somewhat of a complicated issue.  #clusters - #strata (or
#clusters - 1 for unstratified designs) is the most conservative
approximation to the degrees of freedom of a design.

Degrees of freedom are intended to show you what the dimension of the
initially varies in n directions in the response variable space, and
if you are estimating a regression with p variables, you use up those
p degrees of freedom and left with n-p residual degrees of freedom. In
sampling world, your degrees of freedom is by and large determined by
the first stage of sampling. Even though you might have thousands of
observations, all of the variability of the sample statistic might be
confined to a tiny (in number of dimensions) subspace of that R^n
space. If your characteristic is persistent within clusters, then the
dimension of the space in which a parameter estimate can vary is
pretty much the number of clusters -- hence the degrees of freedom
degrees of freedom might be higher.

Korn & Graubard discuss this in a couple of places; in their 1999 book
(http://www.citeulike.org/user/ctacmo/article/553280), as well as in
(http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you
can increase the degrees of freedom (and that's what Austin was
suggesting in his first email), although that inevitably involves
decisions that are not technically justified.

The design effects for more complicated models such as regression are
also getting more complicated. It is typically believed that the
design effects are smaller for the coefficient estimates in regression
(see Skinner's chapter in 1989 book,
http://www.citeulike.org/user/ctacmo/article/716034), since some of
the differences between clusters are accounted for by explanatory
variables. However there are occasions when the design effects are not
negligible and may grow with the sample sizes
(http://www.citeulike.org/user/ctacmo/article/2862653), so that's not
always a clear cut.

--
Stas Kolenikov, also found at http://stas.kolenikov.name