[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Steven Samuels <sjhsamuels@earthlink.net> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: cluster and F test |

Date |
Thu, 17 Jul 2008 09:47:46 -0400 |

You are welcome, 聲gel.

The formula that Stas and I gave is for a simplified theoretical model of cluster sampling. Kish does work more with "roh", but this is a function of the two components of variance. In his book, the similar formulas appropriate for finite-population sampling are 5.6.5 and 5.6.5'. See also section 8.3b for the optimal design with cost functions.

-Steve

On Jul 17, 2008, at 8:58 AM, 聲gel Rodr璲uez Laso wrote:

Thanks to Steven and Stas for their input.

I wasn't aware of the existence of the formula that both of them mention:

var=(s_b^2/m)+(s_w^2/nm)

The one I was using to calculate DEFF due to clustering effects is:

DEFF_c = 1+(n-1)r

where n is number of individuals per cluster and r is the intraclass

correlation coefficient, as mentioned by Steve. I've found this

formula in different papers and probably is also is included in Kish

1994, Survey Sampling. New York: Wiley and Sons, Inc.

I'm sure Stata calculates standard errors with a different system, but

it is possible to get a reasonable aproximation, as shown in the

example I sent, by dividing the total sample size by the DEFF

(calculated with the above formula) and then using the corrected

sample size in the formula for the standard error assuming a SRS.

I've checked the procedure is also valid when DEFF is over 1.

My point was that in this procedure, all n, r and mn are taken into

consideration while surprisingly (at least for me) in regression,

degrees of freedom do not take account of the total sample size,

whatever big it is. I wonder if the same procedure (calculating a

corrected sample size with deff and then using usual formulas for

standard errors) would give also approximate results for regressions.

I haven愒 the skills to do it in Stata.

I've read Korn and Graubard's book and they don't give much

explanation on the reason to chose #clusters - #strata as the degrees

of freedom.

Cheers,

聲gel

2008/7/15, Stas Kolenikov <skolenik@gmail.com>:

On 7/11/08, 聲gel Rodr璲uez Laso <angelrlaso@gmail.com> wrote:So the standard error is calculated on the effective sample size (16648;That's not how the standard errors and design effects are computed.

p(1-p)/se*se) that, if corrected by deft*deft becomes

(16648*0.855246*0.855246) 12177, much closer to the number of

observations than to the number of clusters.

The primary computation is the variance of the estimates (using

svy-appropriate estimator, in your case Taylor series linearization,

also coming out of -robust- and -cluster- options in non-svy

commands). Then the design effects are computed as a secondary measure

comparing the resulting standard errors to the "ideal" ones that would

be obtained from an SRS sample.

As Steven said, in cluster designs, there is no single total sample

size that you are using to achieve a given power or precision of your

estimates. For a sample size of 1000, you might have 40 clusters with

25 observations, or 500 clusters with 2 observations, or 10 clusters

with 50 observations plus another 5 clusters with 100 observations.

They will all likely give quite different design effects. I'd

encourage you to think about minimizing the variance Steven gives in

the end of his email with respect to n and m, given the (average)

within and between variances S_w^2 and S_b^2 that you need to know.

Same thinking applies to proportions where observations are 0s and 1s.

Going back to the issue of the degrees of freedom of the design --

this is somewhat of a complicated issue. #clusters - #strata (or

#clusters - 1 for unstratified designs) is the most conservative

approximation to the degrees of freedom of a design.

Degrees of freedom are intended to show you what the dimension of the

space where your estimates live is. In i.i.d. settings, your sample

initially varies in n directions in the response variable space, and

if you are estimating a regression with p variables, you use up those

p degrees of freedom and left with n-p residual degrees of freedom. In

sampling world, your degrees of freedom is by and large determined by

the first stage of sampling. Even though you might have thousands of

observations, all of the variability of the sample statistic might be

confined to a tiny (in number of dimensions) subspace of that R^n

space. If your characteristic is persistent within clusters, then the

dimension of the space in which a parameter estimate can vary is

pretty much the number of clusters -- hence the degrees of freedom

given above. If your characteristic has sufficient spread within

clusters, similar to the spread in the population, then your effective

degrees of freedom might be higher.

Korn & Graubard discuss this in a couple of places; in their 1999 book

(http://www.citeulike.org/user/ctacmo/article/553280), as well as in

earlier 1995 JRSS A paper

(http://www.citeulike.org/user/ctacmo/article/933864). Sometimes, you

can increase the degrees of freedom (and that's what Austin was

suggesting in his first email), although that inevitably involves

decisions that are not technically justified.

The design effects for more complicated models such as regression are

also getting more complicated. It is typically believed that the

design effects are smaller for the coefficient estimates in regression

(see Skinner's chapter in 1989 book,

http://www.citeulike.org/user/ctacmo/article/716034), since some of

the differences between clusters are accounted for by explanatory

variables. However there are occasions when the design effects are not

negligible and may grow with the sample sizes

(http://www.citeulike.org/user/ctacmo/article/2862653), so that's not

always a clear cut.

--

Stas Kolenikov, also found at http://stas.kolenikov.name

Small print: Please do not reply to my Gmail address as I don't check

it regularly.

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: cluster and F test***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: cluster and F test***From:*sara borelli <saraborelli77@yahoo.it>

**Re: st: cluster and F test***From:*"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: cluster and F test***From:*"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*"Stas Kolenikov" <skolenik@gmail.com>

**Re: st: cluster and F test***From:*"聲gel Rodr璲uez Laso" <angelrlaso@gmail.com>

- Prev by Date:
**Re: st: Wildcard for string variables** - Next by Date:
**Re: st: re: heteroskedasticity and fixed effects** - Previous by thread:
**Re: st: cluster and F test** - Next by thread:
**Re: st: cluster and F test** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |