# Re: st: cluster and F test

 From Steven Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: cluster and F test Date Thu, 17 Jul 2008 09:47:46 -0400

You are welcome, 聲gel.

The formula that Stas and I gave is for a simplified theoretical model of cluster sampling. Kish does work more with "roh", but this is a function of the two components of variance. In his book, the similar formulas appropriate for finite-population sampling are 5.6.5 and 5.6.5'. See also section 8.3b for the optimal design with cost functions.

-Steve

On Jul 17, 2008, at 8:58 AM, 聲gel Rodr璲uez Laso wrote:

Thanks to Steven and Stas for their input.

I wasn't aware of the existence of the formula that both of them mention:

var=(s_b^2/m)+(s_w^2/nm)

The one I was using to calculate DEFF due to clustering effects is:

DEFF_c = 1+(n-1)r

where n is number of individuals per cluster and r is the intraclass
correlation coefficient, as mentioned by Steve. I've found this
formula in different papers and probably is also is included in Kish
1994, Survey Sampling. New York: Wiley and Sons, Inc.

I'm sure Stata calculates standard errors with a different system, but
it is possible to get a reasonable aproximation, as shown in the
example I sent, by dividing the total sample size by the DEFF
(calculated with the above formula) and then using the corrected
sample size in the formula for the standard error assuming a SRS.
I've checked the procedure is also valid when DEFF is over 1.

My point was that in this procedure, all n, r and mn are taken into
consideration while surprisingly (at least for me) in regression,
degrees of freedom do not take account of the total sample size,
whatever big it is. I wonder if the same procedure (calculating a
corrected sample size with deff and then using usual formulas for
standard errors) would give also approximate results for regressions.
I haven愒 the skills to do it in Stata.

I've read Korn and Graubard's book and they don't give much
explanation on the reason to chose #clusters - #strata as the degrees
of freedom.

Cheers,

2008/7/15, Stas Kolenikov <skolenik@gmail.com>:

```On 7/11/08, 聲gel Rodr璲uez Laso <angelrlaso@gmail.com> wrote:
```
So the standard error is calculated on the effective sample size (16648;
p(1-p)/se*se) that, if corrected by deft*deft becomes
(16648*0.855246*0.855246) 12177, much closer to the number of
observations than to the number of clusters.
That's not how the standard errors and design effects are computed.
The primary computation is the variance of the estimates (using
svy-appropriate estimator, in your case Taylor series linearization,
also coming out of -robust- and -cluster- options in non-svy
commands). Then the design effects are computed as a secondary measure
comparing the resulting standard errors to the "ideal" ones that would
be obtained from an SRS sample.

As Steven said, in cluster designs, there is no single total sample
size that you are using to achieve a given power or precision of your
estimates. For a sample size of 1000, you might have 40 clusters with
25 observations, or 500 clusters with 2 observations, or 10 clusters
with 50 observations plus another 5 clusters with 100 observations.
They will all likely give quite different design effects. I'd
encourage you to think about minimizing the variance Steven gives in
the end of his email with respect to n and m, given the (average)
within and between variances S_w^2 and S_b^2 that you need to know.
Same thinking applies to proportions where observations are 0s and 1s.

Going back to the issue of the degrees of freedom of the design --
this is somewhat of a complicated issue. #clusters - #strata (or
#clusters - 1 for unstratified designs) is the most conservative
approximation to the degrees of freedom of a design.

Degrees of freedom are intended to show you what the dimension of the
initially varies in n directions in the response variable space, and
if you are estimating a regression with p variables, you use up those
p degrees of freedom and left with n-p residual degrees of freedom. In
sampling world, your degrees of freedom is by and large determined by
the first stage of sampling. Even though you might have thousands of
observations, all of the variability of the sample statistic might be
confined to a tiny (in number of dimensions) subspace of that R^n
space. If your characteristic is persistent within clusters, then the
dimension of the space in which a parameter estimate can vary is
pretty much the number of clusters -- hence the degrees of freedom
degrees of freedom might be higher.

Korn & Graubard discuss this in a couple of places; in their 1999 book
(http://www.citeulike.org/user/ctacmo/article/553280), as well as in
(http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you
can increase the degrees of freedom (and that's what Austin was
suggesting in his first email), although that inevitably involves
decisions that are not technically justified.

The design effects for more complicated models such as regression are
also getting more complicated. It is typically believed that the
design effects are smaller for the coefficient estimates in regression
(see Skinner's chapter in 1989 book,
http://www.citeulike.org/user/ctacmo/article/716034), since some of
the differences between clusters are accounted for by explanatory
variables. However there are occasions when the design effects are not
negligible and may grow with the sample sizes
(http://www.citeulike.org/user/ctacmo/article/2862653), so that's not
always a clear cut.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
it regularly.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

```*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```
```
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```