# Re: st: cluster and F test

 From "Stas Kolenikov" To statalist@hsphsun2.harvard.edu Subject Re: st: cluster and F test Date Thu, 17 Jul 2008 10:19:18 -0500

```On 7/17/08, Ángel Rodríguez Laso <angelrlaso@gmail.com> wrote:
> Thanks to Steven and Stas for their input.
>
>  I wasn't aware of the existence of the formula that both of them mention:
>
>  var=(s_b^2/m)+(s_w^2/nm)
>
>  The one I was using to calculate DEFF due to clustering effects is:
>
>  DEFF_c = 1+(n-1)r
>
>  where n is number of individuals per cluster and r is the intraclass
>  correlation coefficient, as mentioned by Steve. I've found this
>  formula in different papers and probably is also is included in Kish
>  1994, Survey Sampling. New York: Wiley and Sons, Inc.

Yes, I think it is attributed to Kish and dates back to the first
edition of his book (mid 50s, as far as I remember). Again, his is a
very particular situations -- I gave a reference to regression, you
might want to look it up.

>  My point was that in this procedure, all n, r and mn are taken into
>  consideration while surprisingly (at least for me) in regression,
>  degrees of freedom do not take account of the total sample size,
>  whatever big it is. I wonder if the same procedure (calculating a
>  corrected sample size with deff and then using usual formulas for
>  standard errors) would give also approximate results for regressions.
>  I haven´t the skills to do it in Stata.

You don't have to, Stata will take care of all the variance estimation
needs if you specify the -svyset- properly. The concept of DEFF can be
generalized to more complex models like regression, but then there the
DEFF's are eigenvalues of the matrix (Variance of actual design) times
(Variance of SRS)^{-1}. It is not quite clear what design effect is to
be attributed to each variable, etc.

Linear models with i.i.d. errors develop all sorts of intuitions, some
are good, some are bad. The degrees of freedom that come out of those,
sample size - #parameters, is an example where the intuition is not
that great. The generalization of the degrees of freedom concept to
other models is based on the idea of how small perturbations in random
components of the model are getting transmitted to the variability of
estimates and predictions. This can be formalized, for instance, by
the matrix of the partial derivatives of the predictions with respect
to y's in the model, or by covariances of predictions with y's, and in
a bunch of similar ways. In linear models, the random componetns are
the error terms, and the relevant variability in predictions should be
traced from variability in those epsilons. That partial derivatives
matrix can be computed analytically: it is the hat matrix, I -
X(X'X)^{-1}X', and the rank of this matrix is precisely n-p, sample
size minus number of parameters. The idea can be generalized to
nonlinear models and to data mining with flexible models like splines
or trees or mixtures or other heavily data dependent models, where
each parameter costs not one, but about three degrees of freedom
(http://www.citeulike.org/user/ctacmo/article/574999).

In survey sampling, the measurements are assumed to be fixed
quantities, and the randomness is in the observations that are chosen
for a particular sample (which is described in design based inference
paradigm by random 0/1 inclusion indicators;
http://www.citeulike.org/user/ctacmo/article/1036973). Once you try to
trace the effect of that randomness on the estimation results, you get
the picture that I tried to describe geometrically. Small perturbation
in the random components should be thought of as replacement of an
obseration in the sample by another observation "close" in the
sampling scheme. In other words, an observation from a cluster can be
replaced by another observation in that same cluster. When you do
that, you probably won't see much change in the results if
observations within clusters are positively correlated. Moreover,
observations in other strata may not be affected at all, which
dramatically reduces the rank of that partial derivatives matrix that
I mentioned. When you analyze this more formally, you do see that the
(lower bound on) degrees of freedom comes out to be #clusters -
#strata. It is a lower bound, so in practical applications it might be
too conservative, but on the other hand it can be attained with some
populations and some designs.

>  I've read Korn and Graubard's book and they don't give much
>  explanation on the reason to chose #clusters - #strata as the degrees
>  of freedom.

I tried to explain the geometric considreations for the #clusters, and
#strata is simply the number of means you need to estimate, one per
each stratum. Degrees of freedom are extensively discussed in Sec.
5.2.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: Please do not reply to my Gmail address as I don't check
it regularly.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```