[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Steven Samuels <sjhsamuels@earthlink.net> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: cluster and F test |

Date |
Mon, 14 Jul 2008 17:26:53 -0400 |

-- Ángel: On Jul 11, 2008, at 4:19 AM, Ángel Rodríguez Laso wrote:

In your example, the DEFT is <1, indicating that the cluster sample is more precise than a SRS. This would happen, for example, if in every cluster, the proportion of "si" is about 10%, the population proportion. Essentially, the "between-cluster" SD would zero in the formula I previously presented. In such a case, the total sample size matters, not the number of clusters.Dear Steven, From my readings I've understood that the design effect comprises all loss of precission due to clustering and weighting. Once the sample size is corrected by the design effect, what matters is the number of observations. These are the results for the proportion of a variable in a complex design survey:

In the general case, however, the between-cluster SD is not zero This would happen is the trait you were studying was unevenly distributed among clusters. The most extreme case: if all subjects in 127 clusters had "si", and all subjects in the remaining 1,139 clusters had "non", then the effective sample size would be the number of clusters. In the absence of stratification contributions to the design effect, the approximate value of DEFF would be "n", where "n" is the average cluster size.

You are misinterpreting Austin's statement (I could not find the one you mean). Of course, the number of observations per cluster matters, but only up to a point. The approximate formula for the variance of a mean that I gave previously was:What is surprising for me is that in regression in this context, only the number of clusters count and not the number of individuals per cluster (or the total number of individuals), as it's been said by Austin. That amounts to saying that having 1000 observations per cluster would yield the same precision than having 1.

var = [(s_b)^2]/m + [(s_w)^2]/nm.

where m = no. clusters, n = number of observations /cluster.

You can see that increasing n does decrease the variance, but this decrease affects only the 2nd term. On the list we occasionally see examples where investigators took a small number of clusters and a huge sample size in some of them, and then were surprised at the big standard errors. For more details, find the formulas for the design effect and for choosing the sample size for clusters in one of the texts I referred to.

(Aside: In your example DEFF = -5863. This is a number that should be positive! According to the Stata manual, the value for DEFF is valid only if original population weights are used. In your example the weights are scaled to total the sample size, not the population size, and this may have caused the wild value.)

-Steve

svyset psu [pweight=pesodef2007], strata(areasalud)fpc(secperarea)

pweight: pesodef2007

VCE: linearized

Strata 1: areasalud

SU 1: psu

FPC 1: secperarea

. svy:prop p45

(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata = 11 Number of obs = 12174

Number of PSUs = 1266 Population size = 12172,5

Design df = 1255

--------------------------------------------------------------

| Linearized Binomial Wald

| Proportion Std. Err. [95% Conf. Interval]

-------------+------------------------------------------------

p45 |

sí | ,0994565 ,0023199 ,0949052 ,1040077

no | ,9005435 ,0023199 ,8959923 ,9050948

--------------------------------------------------------------

. estat effects

----------------------------------------------------------

| Linearized

| Proportion Std. Err. Deff Deft

-------------+--------------------------------------------

p45 |

sí | ,0994565 ,0023199 -5863 ,855246

no | ,9005435 ,0023199 -5863 ,855246

----------------------------------------------------------

Note: Weights must represent population totals for deff to be correct

when using an FPC; however, deft is

invariant to the scale of weights.

end of do-file

So the standard error is calculated on the effective sample size (16648;

p(1-p)/se*se) that, if corrected by deft*deft becomes

(16648*0.855246*0.855246) 12177, much closer to the number of

observations than to the number of clusters. That´s the reason why I

comment that for precision, the sample size is a very important

determinant. In fact, there is no disagreement between both points of

views because the total sample size is determined by the number of

clusters and the number of observations per cluster.

What is surprising for me is that in regression in this context, only

the number of clusters count and not the number of individuals per

cluster (or the total number of individuals), as it's been said by

Austin. That amounts to saying that having 1000 observations per

cluster would yield the same precision than having 1.

Cheers,

Ángel

2008/7/8, Steven Samuels <sjhsamuels@earthlink.net>:

Angel, the primary determinant of precision is the number of clusters, and

degrees of freedom are based on these.

To compute the sample size needed in a cluster sample, you need to estimate

the number of clusters needed *and* the number of observations per cluster.

Consider an extreme case: everybody in a cluster has the same value of an

outcome "Y", but the means differ between clusters. Here one observation

will completely represent the cluster and only the number of clusters

matters. At the other extreme, if each cluster is a miniature of the

original population and cluster are very similar, then relatively few

clusters are needed and more observations can be taken per cluster.

In practice, the actual choice of clusters/observations per cluster is made

on the basis of the budget, on the relative costs of adding a cluster and of

adding an additional observation within a cluster, and the ratios the SD's

for the main outcomes between and within clusters. As there are usually

several outcomes, a compromise sample size is chosen. See: Sharon Lohr,

Sampling: Design and Analysis, Duxbury, 1999, Chapter 5; WG Cochran,

Sampling Techniques, Wiley, 1977; L Kish, Survey Sampling, Wiley, 1965.

There are many internet references.

Key concepts: the intra-class correlation, which measures how similar

observations in the same clusters are compared to observations in different

clusters; the "design effect", which shows how the standard error of a

complex cluster sample is inflated compared to a simple random sample of the

same number of observations. Joanne Garret's program -sampclus-, (findit

sampclus), requires the investigator to input the correlation. It is most

easily calculated by a variance components analysis of similar data.

A *theoretical* nested model can make some concepts clearer (Lohr). Suppose

there are observations Y_ij = c + a_i + e_ij. There are m random effects a_i

from a distribution with between-cluster SD s_b and, for each a_i, there are

n e_ij's drawn from a distribution with "within-cluster" SD s_w. The a's and

e's are independent. The total sample size is nm, and the variance of the

sample mean is:

var = [(s_b)^2]/m + [(s_w)^2]/nm. You can see that, holding m fixed,

increasing the number of observations per cluster decreases only the 2nd

term.

The actual formulas for sampling from finite populations are more

complicated, but the same principles apply.

-Steve

On Jul 8, 2008, at 5:07 AM, Ángel Rodríguez Laso wrote:

Following the discussion, I don´t understand very well how degrees of

freedom (number of clusters-number of strata) and the actual number of

observations are used in svy commands (which are related to cluster

regression). I say so because when I calculate the sample size needed

in a survey to get a proportion with a determined confidence level,

the number I get is the number of observations and not the number of

degrees of freedom. So I assume that the number of observations is

what conditions the standard error and then I don´t know what degrees

of freedom are used for.

Cheers,

Ángel Rodríguez

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: cluster and F test***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: cluster and F test***From:*sara borelli <saraborelli77@yahoo.it>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

**Re: st: cluster and F test***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: cluster and F test***From:*"Ángel Rodríguez Laso" <angelrlaso@gmail.com>

- Prev by Date:
**Re: st: ordered heckman help** - Next by Date:
**st: -impute- variable creates only type 'float'** - Previous by thread:
**Re: st: cluster and F test** - Next by thread:
**Re: st: cluster and F test** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |