[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: cluster and F test |

Date |
Thu, 17 Jul 2008 14:58:41 +0200 |

Thanks to Steven and Stas for their input. I wasn't aware of the existence of the formula that both of them mention: var=(s_b^2/m)+(s_w^2/nm) The one I was using to calculate DEFF due to clustering effects is: DEFF_c = 1+(n-1)r where n is number of individuals per cluster and r is the intraclass correlation coefficient, as mentioned by Steve. I've found this formula in different papers and probably is also is included in Kish 1994, Survey Sampling. New York: Wiley and Sons, Inc. I'm sure Stata calculates standard errors with a different system, but it is possible to get a reasonable aproximation, as shown in the example I sent, by dividing the total sample size by the DEFF (calculated with the above formula) and then using the corrected sample size in the formula for the standard error assuming a SRS. I've checked the procedure is also valid when DEFF is over 1. My point was that in this procedure, all n, r and mn are taken into consideration while surprisingly (at least for me) in regression, degrees of freedom do not take account of the total sample size, whatever big it is. I wonder if the same procedure (calculating a corrected sample size with deff and then using usual formulas for standard errors) would give also approximate results for regressions. I haven愒 the skills to do it in Stata. I've read Korn and Graubard's book and they don't give much explanation on the reason to chose #clusters - #strata as the degrees of freedom. Cheers, 聲gel 2008/7/15, Stas Kolenikov <skolenik@gmail.com>: > On 7/11/08, 聲gel Rodr璲uez Laso <angelrlaso@gmail.com> wrote: > > So the standard error is calculated on the effective sample size (16648; > > p(1-p)/se*se) that, if corrected by deft*deft becomes > > (16648*0.855246*0.855246) 12177, much closer to the number of > > observations than to the number of clusters. > > That's not how the standard errors and design effects are computed. > The primary computation is the variance of the estimates (using > svy-appropriate estimator, in your case Taylor series linearization, > also coming out of -robust- and -cluster- options in non-svy > commands). Then the design effects are computed as a secondary measure > comparing the resulting standard errors to the "ideal" ones that would > be obtained from an SRS sample. > > As Steven said, in cluster designs, there is no single total sample > size that you are using to achieve a given power or precision of your > estimates. For a sample size of 1000, you might have 40 clusters with > 25 observations, or 500 clusters with 2 observations, or 10 clusters > with 50 observations plus another 5 clusters with 100 observations. > They will all likely give quite different design effects. I'd > encourage you to think about minimizing the variance Steven gives in > the end of his email with respect to n and m, given the (average) > within and between variances S_w^2 and S_b^2 that you need to know. > Same thinking applies to proportions where observations are 0s and 1s. > > Going back to the issue of the degrees of freedom of the design -- > this is somewhat of a complicated issue. #clusters - #strata (or > #clusters - 1 for unstratified designs) is the most conservative > approximation to the degrees of freedom of a design. > > Degrees of freedom are intended to show you what the dimension of the > space where your estimates live is. In i.i.d. settings, your sample > initially varies in n directions in the response variable space, and > if you are estimating a regression with p variables, you use up those > p degrees of freedom and left with n-p residual degrees of freedom. In > sampling world, your degrees of freedom is by and large determined by > the first stage of sampling. Even though you might have thousands of > observations, all of the variability of the sample statistic might be > confined to a tiny (in number of dimensions) subspace of that R^n > space. If your characteristic is persistent within clusters, then the > dimension of the space in which a parameter estimate can vary is > pretty much the number of clusters -- hence the degrees of freedom > given above. If your characteristic has sufficient spread within > clusters, similar to the spread in the population, then your effective > degrees of freedom might be higher. > > Korn & Graubard discuss this in a couple of places; in their 1999 book > (http://www.citeulike.org/user/ctacmo/article/553280), as well as in > earlier 1995 JRSS A paper > (http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you > can increase the degrees of freedom (and that's what Austin was > suggesting in his first email), although that inevitably involves > decisions that are not technically justified. > > The design effects for more complicated models such as regression are > also getting more complicated. It is typically believed that the > design effects are smaller for the coefficient estimates in regression > (see Skinner's chapter in 1989 book, > http://www.citeulike.org/user/ctacmo/article/716034), since some of > the differences between clusters are accounted for by explanatory > variables. However there are occasions when the design effects are not > negligible and may grow with the sample sizes > (http://www.citeulike.org/user/ctacmo/article/2862653), so that's not > always a clear cut. > > -- > Stas Kolenikov, also found at http://stas.kolenikov.name > Small print: Please do not reply to my Gmail address as I don't check > it regularly. > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: cluster and F test***From:*"Stas Kolenikov" <skolenik@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**References**:**Re: st: cluster and F test***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: cluster and F test***From:*sara borelli <saraborelli77@yahoo.it>

**Re: st: cluster and F test***From:*"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: cluster and F test***From:*"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*"Stas Kolenikov" <skolenik@gmail.com>

- Prev by Date:
**st: Wildcard for string variables** - Next by Date:
**Re: st: Wildcard for string variables** - Previous by thread:
**Re: st: cluster and F test** - Next by thread:
**Re: st: cluster and F test** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |