[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Stas Kolenikov" <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: cluster and F test |

Date |
Tue, 15 Jul 2008 09:49:28 -0500 |

On 7/11/08, Ángel Rodríguez Laso <angelrlaso@gmail.com> wrote: > So the standard error is calculated on the effective sample size (16648; > p(1-p)/se*se) that, if corrected by deft*deft becomes > (16648*0.855246*0.855246) 12177, much closer to the number of > observations than to the number of clusters. That's not how the standard errors and design effects are computed. The primary computation is the variance of the estimates (using svy-appropriate estimator, in your case Taylor series linearization, also coming out of -robust- and -cluster- options in non-svy commands). Then the design effects are computed as a secondary measure comparing the resulting standard errors to the "ideal" ones that would be obtained from an SRS sample. As Steven said, in cluster designs, there is no single total sample size that you are using to achieve a given power or precision of your estimates. For a sample size of 1000, you might have 40 clusters with 25 observations, or 500 clusters with 2 observations, or 10 clusters with 50 observations plus another 5 clusters with 100 observations. They will all likely give quite different design effects. I'd encourage you to think about minimizing the variance Steven gives in the end of his email with respect to n and m, given the (average) within and between variances S_w^2 and S_b^2 that you need to know. Same thinking applies to proportions where observations are 0s and 1s. Going back to the issue of the degrees of freedom of the design -- this is somewhat of a complicated issue. #clusters - #strata (or #clusters - 1 for unstratified designs) is the most conservative approximation to the degrees of freedom of a design. Degrees of freedom are intended to show you what the dimension of the space where your estimates live is. In i.i.d. settings, your sample initially varies in n directions in the response variable space, and if you are estimating a regression with p variables, you use up those p degrees of freedom and left with n-p residual degrees of freedom. In sampling world, your degrees of freedom is by and large determined by the first stage of sampling. Even though you might have thousands of observations, all of the variability of the sample statistic might be confined to a tiny (in number of dimensions) subspace of that R^n space. If your characteristic is persistent within clusters, then the dimension of the space in which a parameter estimate can vary is pretty much the number of clusters -- hence the degrees of freedom given above. If your characteristic has sufficient spread within clusters, similar to the spread in the population, then your effective degrees of freedom might be higher. Korn & Graubard discuss this in a couple of places; in their 1999 book (http://www.citeulike.org/user/ctacmo/article/553280), as well as in earlier 1995 JRSS A paper (http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you can increase the degrees of freedom (and that's what Austin was suggesting in his first email), although that inevitably involves decisions that are not technically justified. The design effects for more complicated models such as regression are also getting more complicated. It is typically believed that the design effects are smaller for the coefficient estimates in regression (see Skinner's chapter in 1989 book, http://www.citeulike.org/user/ctacmo/article/716034), since some of the differences between clusters are accounted for by explanatory variables. However there are occasions when the design effects are not negligible and may grow with the sample sizes (http://www.citeulike.org/user/ctacmo/article/2862653), so that's not always a clear cut. -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: Please do not reply to my Gmail address as I don't check it regularly. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

**References**:**Re: st: cluster and F test***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: cluster and F test***From:*sara borelli <saraborelli77@yahoo.it>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

- Prev by Date:
**st: Re: RE: RE: -forval- with -inequal7-** - Next by Date:
**st: RE: -impute- variable creates only type 'float'** - Previous by thread:
**Re: st: cluster and F test** - Next by thread:
**Re: st: cluster and F test** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |