# Re: st: cluster and F test

 From "聲gel Rodr璲uez Laso" To statalist@hsphsun2.harvard.edu Subject Re: st: cluster and F test Date Thu, 17 Jul 2008 14:58:41 +0200

```Thanks to Steven and Stas for their input.

I wasn't aware of the existence of the formula that both of them mention:

var=(s_b^2/m)+(s_w^2/nm)

The one I was using to calculate DEFF due to clustering effects is:

DEFF_c = 1+(n-1)r

where n is number of individuals per cluster and r is the intraclass
correlation coefficient, as mentioned by Steve. I've found this
formula in different papers and probably is also is included in Kish
1994, Survey Sampling. New York: Wiley and Sons, Inc.

I'm sure Stata calculates standard errors with a different system, but
it is possible to get a reasonable aproximation, as shown in the
example I sent, by dividing the total sample size by the DEFF
(calculated with the above formula) and then using the corrected
sample size in the formula for the standard error assuming a SRS.
I've checked the procedure is also valid when DEFF is over 1.

My point was that in this procedure, all n, r and mn are taken into
consideration while surprisingly (at least for me) in regression,
degrees of freedom do not take account of the total sample size,
whatever big it is. I wonder if the same procedure (calculating a
corrected sample size with deff and then using usual formulas for
standard errors) would give also approximate results for regressions.
I haven愒 the skills to do it in Stata.

I've read Korn and Graubard's book and they don't give much
explanation on the reason to chose #clusters - #strata as the degrees
of freedom.

Cheers,

2008/7/15, Stas Kolenikov <skolenik@gmail.com>:
> On 7/11/08, 聲gel Rodr璲uez Laso <angelrlaso@gmail.com> wrote:
> >  So the standard error is calculated on the effective sample size (16648;
> >  p(1-p)/se*se) that, if corrected by deft*deft becomes
> >  (16648*0.855246*0.855246) 12177, much closer to the number of
> >  observations than to the number of clusters.
>
> That's not how the standard errors and design effects are computed.
> The primary computation is the variance of the estimates (using
> svy-appropriate estimator, in your case Taylor series linearization,
> also coming out of -robust- and -cluster- options in non-svy
> commands). Then the design effects are computed as a secondary measure
> comparing the resulting standard errors to the "ideal" ones that would
> be obtained from an SRS sample.
>
> As Steven said, in cluster designs, there is no single total sample
> size that you are using to achieve a given power or precision of your
> estimates. For a sample size of 1000, you might have 40 clusters with
> 25 observations, or 500 clusters with 2 observations, or 10 clusters
> with 50 observations plus another 5 clusters with 100 observations.
> They will all likely give quite different design effects. I'd
> encourage you to think about minimizing the variance Steven gives in
> the end of his email with respect to n and m, given the (average)
> within and between variances S_w^2 and S_b^2 that you need to know.
> Same thinking applies to proportions where observations are 0s and 1s.
>
> Going back to the issue of the degrees of freedom of the design --
> this is somewhat of a complicated issue.  #clusters - #strata (or
> #clusters - 1 for unstratified designs) is the most conservative
> approximation to the degrees of freedom of a design.
>
> Degrees of freedom are intended to show you what the dimension of the
> space where your estimates live is. In i.i.d. settings, your sample
> initially varies in n directions in the response variable space, and
> if you are estimating a regression with p variables, you use up those
> p degrees of freedom and left with n-p residual degrees of freedom. In
> sampling world, your degrees of freedom is by and large determined by
> the first stage of sampling. Even though you might have thousands of
> observations, all of the variability of the sample statistic might be
> confined to a tiny (in number of dimensions) subspace of that R^n
> space. If your characteristic is persistent within clusters, then the
> dimension of the space in which a parameter estimate can vary is
> pretty much the number of clusters -- hence the degrees of freedom
> clusters, similar to the spread in the population, then your effective
> degrees of freedom might be higher.
>
> Korn & Graubard discuss this in a couple of places; in their 1999 book
> (http://www.citeulike.org/user/ctacmo/article/553280), as well as in
> earlier 1995 JRSS A paper
> (http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you
> can increase the degrees of freedom (and that's what Austin was
> suggesting in his first email), although that inevitably involves
> decisions that are not technically justified.
>
> The design effects for more complicated models such as regression are
> also getting more complicated. It is typically believed that the
> design effects are smaller for the coefficient estimates in regression
> (see Skinner's chapter in 1989 book,
> http://www.citeulike.org/user/ctacmo/article/716034), since some of
> the differences between clusters are accounted for by explanatory
> variables. However there are occasions when the design effects are not
> negligible and may grow with the sample sizes
> (http://www.citeulike.org/user/ctacmo/article/2862653), so that's not
> always a clear cut.
>
> --
> Stas Kolenikov, also found at http://stas.kolenikov.name
> it regularly.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```