[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Stas Kolenikov" <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: cluster and F test |

Date |
Thu, 17 Jul 2008 10:19:18 -0500 |

On 7/17/08, Ángel Rodríguez Laso <angelrlaso@gmail.com> wrote: > Thanks to Steven and Stas for their input. > > I wasn't aware of the existence of the formula that both of them mention: > > var=(s_b^2/m)+(s_w^2/nm) > > The one I was using to calculate DEFF due to clustering effects is: > > DEFF_c = 1+(n-1)r > > where n is number of individuals per cluster and r is the intraclass > correlation coefficient, as mentioned by Steve. I've found this > formula in different papers and probably is also is included in Kish > 1994, Survey Sampling. New York: Wiley and Sons, Inc. Yes, I think it is attributed to Kish and dates back to the first edition of his book (mid 50s, as far as I remember). Again, his is a very particular situations -- I gave a reference to regression, you might want to look it up. > My point was that in this procedure, all n, r and mn are taken into > consideration while surprisingly (at least for me) in regression, > degrees of freedom do not take account of the total sample size, > whatever big it is. I wonder if the same procedure (calculating a > corrected sample size with deff and then using usual formulas for > standard errors) would give also approximate results for regressions. > I haven´t the skills to do it in Stata. You don't have to, Stata will take care of all the variance estimation needs if you specify the -svyset- properly. The concept of DEFF can be generalized to more complex models like regression, but then there the DEFF's are eigenvalues of the matrix (Variance of actual design) times (Variance of SRS)^{-1}. It is not quite clear what design effect is to be attributed to each variable, etc. Linear models with i.i.d. errors develop all sorts of intuitions, some are good, some are bad. The degrees of freedom that come out of those, sample size - #parameters, is an example where the intuition is not that great. The generalization of the degrees of freedom concept to other models is based on the idea of how small perturbations in random components of the model are getting transmitted to the variability of estimates and predictions. This can be formalized, for instance, by the matrix of the partial derivatives of the predictions with respect to y's in the model, or by covariances of predictions with y's, and in a bunch of similar ways. In linear models, the random componetns are the error terms, and the relevant variability in predictions should be traced from variability in those epsilons. That partial derivatives matrix can be computed analytically: it is the hat matrix, I - X(X'X)^{-1}X', and the rank of this matrix is precisely n-p, sample size minus number of parameters. The idea can be generalized to nonlinear models and to data mining with flexible models like splines or trees or mixtures or other heavily data dependent models, where each parameter costs not one, but about three degrees of freedom (http://www.citeulike.org/user/ctacmo/article/574999). In survey sampling, the measurements are assumed to be fixed quantities, and the randomness is in the observations that are chosen for a particular sample (which is described in design based inference paradigm by random 0/1 inclusion indicators; http://www.citeulike.org/user/ctacmo/article/1036973). Once you try to trace the effect of that randomness on the estimation results, you get the picture that I tried to describe geometrically. Small perturbation in the random components should be thought of as replacement of an obseration in the sample by another observation "close" in the sampling scheme. In other words, an observation from a cluster can be replaced by another observation in that same cluster. When you do that, you probably won't see much change in the results if observations within clusters are positively correlated. Moreover, observations in other strata may not be affected at all, which dramatically reduces the rank of that partial derivatives matrix that I mentioned. When you analyze this more formally, you do see that the (lower bound on) degrees of freedom comes out to be #clusters - #strata. It is a lower bound, so in practical applications it might be too conservative, but on the other hand it can be attained with some populations and some designs. > I've read Korn and Graubard's book and they don't give much > explanation on the reason to chose #clusters - #strata as the degrees > of freedom. I tried to explain the geometric considreations for the #clusters, and #strata is simply the number of means you need to estimate, one per each stratum. Degrees of freedom are extensively discussed in Sec. 5.2. -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: Please do not reply to my Gmail address as I don't check it regularly. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: cluster and F test***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: cluster and F test***From:*sara borelli <saraborelli77@yahoo.it>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*"Stas Kolenikov" <skolenik@gmail.com>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

- Prev by Date:
**Re: st: Is there a kind of stochasticity in the execution ofxthtaylor?** - Next by Date:
**RE: RE : Heteroskedasticity and fixed effects (was: st: RE: Re: Weak instruments)** - Previous by thread:
**Re: st: cluster and F test** - Next by thread:
**st: German Stata Users Group Meetings, June 2008** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |