[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: RE: Survey design degrees of freedom help |

Date |
Fri, 4 Sep 2009 10:34:43 -0500 |

Jennifer, You are good to go with the interpretation of your regression coefficients as they relate to the observation level. The way the point estimates are obtained is the pretty much the same as in the plain vanilla data (except that the weights attached to the observations; but you can attach weights without the complex sample structure, too). For instance, in regression case, you would have b = (X' w X)^{-1} (X' w Y) where w is the diagonal matrix of weights. You still use all the observations for point estimation, no compromises are taken there. PSU information only contributes to the variance estimation. Getting the point estimates and getting the variance estimates can (and should) be thought as rather unrelated issues. One of the exercises I give students in my survey statistics classes is to give examples of designs that would produce the same point estimates but different standard errors. I expect them to say something like "SRS vs. stratified sample" or "SRS vs. cluster sample". So getting the point estimates and getting the variance estimates can (and should) be thought as rather unrelated issues. Now, suppose you could collect the data for everybody in the population, and estimate the corresponding model -- let's call this a census regression. The standard errors that Stata gives you measure uncertainty about the census regression parameters, i.e., by how much your point estimates based on your sample might differ from the census parameters. If you sample everybody, your uncertainty is exactly zero. Of course that's not the case in practice when you sample but a tiny fraction of the total population. But still you might sample enough units in a cluster to be pretty sure what the contribution of that cluster is -- that's your independent piece of information for variance estimation (i.e., measuring uncertainty). Even if you took a different sample from that same cluster, you would get roughly the same number. That's why you want to treat the cluster as the independent piece of information. If all your clusters give about the same picture, you will get tight standard errors; if the clusters are all over, the standard errors will be large. You can get a very useful measure of how much impact your sampling desigh has had on your estimation by typing -estat effect- after your -svy- command. It prints DEFF, the design effect, and MEFF, misspecification effect. The first one shows how much the variances of the estimates change because of your complex sampling plan: this is the ratio of the actual variance to the varaince obtained assuming independent data. The second one is measuring by how much you would be mistaken if you applied a naive variance formula. DEFF is somewhat better understood, in general. If you have numbers smaller than 1, you have efficiency gains because of your clever sample design. If you have numbers more than say 3 or 5, your sampling design did not allow you to get a lot of information. Usually that's a consequence of tight clustering or wildly different weights (or both). Sometimes DEFF is interpreted as the efficient sample size: you need [your actual sample size]/DEFF observations in a simple random sample to get the same accuracy of the results. (Keep in mind that SRS are hell of a lot of trouble to set up -- you need the complete list of the population which you never have unless you are The Big Brother aka Census Bureau :)). The largest DEFF I've seen in my practice was about a 100. This was a village level characteristic, access to tap water. Once you have a piped well in the village, every respondent is a 1 on that variable; if you don't have a pipe, everybody is 0. So instead of the total sample size of ~10,000 individuals, I only had ~100 villages that contributed to the estimation of the % with access to tap water. In terms of my above explanation, everybody in the cluster give exactly the same answer, and I am 100% sure about everybody in the cluster having or not having access to tap water (no cluster level uncertainty). The independent piece of information for this variable is given at the cluster level, rather than an individual level. On the other hand, the variables that would reflect information, activity or decision making at the household or individual level, such as age or contraception use, would have DEFFs of the order of 1.5 or so in the same data set. Now, for your particular variable of interest, I imagine the knowledge of a park might be the cluster level variable (either there is a good park nearby, or there are none), while income is certainly a household level variable (although there would still be a tendency for income to be spatially correlated: there are rich neighborhoods, and there are poor neighborhoods). If you run -svy : mean- and -estat effects- on your park knowledge and income variables, I would imagine the first one would have a higher DEFF. It's a shame you cannot find anybody to help you with statistics at UMN. Your statistics program is supposed to be one of the top 10 or so in the nation. On Fri, Sep 4, 2009 at 8:09 AM, Jennifer Schmitt<jorg0206@umn.edu> wrote: > Stas, > Thank you for your answer, I've read the paper you suggested, but > unfortunately my statistical background is very limited (probably part of > why I've having a difficult time with this as it is), so I'm not sure I > followed the entire thing. If I may, I have a follow-up question. If STATA > only uses my villages (PSU) as my independent pieces of information and uses > all my household data as data to help estimate my PSU and its variance does > that mean STATA is not testing for differences in households, but > differences among villages? In other words, when I run my logistic > regression and find income to be positively associated with knowledge of a > park does that really mean that villages with higher incomes have a greater > odds of knowledge or that households with higher incomes have greater odds > of knowledge? I want to be able to speak about households, but your > explanation about what STATA is doing made me worry that STATA is only > telling me about villages. Again, thank you for your help, I honestly have > not found anyone else who can explain the "whys" behind STATA. -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**RE: st: RE: Survey design degrees of freedom help***From:*<Andrew.Clapson@statcan.gc.ca>

**Re: st: RE: Survey design degrees of freedom help***From:*Jennifer Schmitt <jorg0206@umn.edu>

**Re: st: RE: Survey design degrees of freedom help***From:*Stas Kolenikov <skolenik@gmail.com>

**Re: st: RE: Survey design degrees of freedom help***From:*Jennifer Schmitt <jorg0206@umn.edu>

- Prev by Date:
**st: RE: simulate** - Next by Date:
**st: RE: RE: simulate** - Previous by thread:
**Re: st: RE: Survey design degrees of freedom help** - Next by thread:
**RE: st: RE: Survey design degrees of freedom help** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |