|  | 
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: cluster and F test
Angel, the primary determinant of precision is the number of  
clusters, and degrees of freedom are based on these.
To compute the sample size needed in a cluster sample, you need to  
estimate the number of clusters needed *and* the number of  
observations per cluster. Consider an extreme case: everybody in a  
cluster has the same value of an outcome "Y", but the means differ  
between clusters. Here one observation will completely represent the  
cluster and only the number of clusters matters. At the other  
extreme, if each cluster is a miniature of the original population  
and cluster are very similar, then relatively few clusters are needed  
and more observations can be taken per cluster.
In practice, the actual choice of clusters/observations per cluster  
is made on the basis of the budget, on the relative costs of adding a  
cluster and of adding an additional observation within a cluster, and  
the ratios the SD's for the main outcomes between and within  
clusters. As there are usually several outcomes, a compromise sample  
size is chosen. See: Sharon Lohr, Sampling: Design and Analysis,  
Duxbury, 1999, Chapter 5; WG Cochran, Sampling Techniques, Wiley,  
1977; L Kish, Survey Sampling, Wiley, 1965. There are many internet  
references.
Key concepts: the intra-class correlation, which measures how similar  
observations in the same clusters are compared to observations in  
different clusters; the "design effect", which shows how the standard  
error of a complex cluster sample is inflated compared to a simple  
random sample of the same number of observations. Joanne Garret's  
program -sampclus-, (findit sampclus), requires the investigator to  
input the correlation. It is most easily calculated by a variance  
components analysis of similar data.
A *theoretical* nested model can make some concepts clearer (Lohr).  
Suppose there are observations Y_ij = c + a_i + e_ij. There are m  
random effects a_i from a distribution with between-cluster SD s_b  
and, for each a_i, there are n e_ij's drawn from a distribution with  
"within-cluster" SD s_w. The a's and e's are independent. The total  
sample size is nm, and the variance of the sample mean is:
 var = [(s_b)^2]/m + [(s_w)^2]/nm. You can see that, holding m  
fixed, increasing the number of observations per cluster decreases  
only the 2nd term.
The actual formulas for sampling from finite populations are more  
complicated, but the same principles apply.
-Steve
On Jul 8, 2008, at 5:07 AM, �ngel Rodr�guez Laso wrote:
Following the discussion, I don�t understand very well how degrees of
freedom (number of clusters-number of strata) and the actual number of
observations are used in svy commands (which are related to cluster
regression). I say so because when I calculate the sample size needed
in a survey to get a proportion with a determined confidence level,
the number I get is the number of observations and not the number of
degrees of freedom. So I assume that the number of observations is
what conditions the standard error and then I don�t know what degrees
of freedom are used for.
Cheers,
�ngel Rodr�guez
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/