[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: Survey design degrees of freedom help

From   <[email protected]>
To   <[email protected]>
Subject   RE: st: RE: Survey design degrees of freedom help
Date   Fri, 4 Sep 2009 12:01:14 -0400

It might also be worth simply running a 'naive regression' (just logit with pweights) to see the difference.  (I'm not suggesting this is a valid empirical approach, of course)

On a (somewhat) related topic, I have been working with the 'subpop' option of the -svy- commands for my logit models, and though I understand the theoretical basis for specifying a subpopulation instead of simply using specifying 'if var1 == 1', in my case I found it made next to no difference in my standard errors.

It is sometimes interesting to run the simple, technically incorrect models in order to see what effect the more specialized specifications truly have on your particular dataset.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stas Kolenikov
Sent: September 4, 2009 11:35 AM
To: [email protected]
Subject: Re: st: RE: Survey design degrees of freedom help


You are good to go with the interpretation of your regression coefficients as they relate to the observation level. The way the point estimates are obtained is the pretty much the same as in the plain vanilla data (except that the weights attached to the observations; but you can attach weights without the complex sample structure, too). For instance, in regression case, you would have b = (X' w X)^{-1} (X' w Y) where w is the diagonal matrix of weights. You still use all the observations for point estimation, no compromises are taken there. PSU information only contributes to the variance estimation. Getting the point estimates and getting the variance estimates can (and should) be thought as rather unrelated issues. One of the exercises I give students in my survey statistics classes is to give examples of designs that would produce the same point estimates but different standard errors. I expect them to say something like "SRS vs. stratified sample" or "SRS vs. cluster sa!
 mple". So getting the point estimates and getting the variance estimates can (and
should) be thought as rather unrelated issues.

Now, suppose you could collect the data for everybody in the population, and estimate the corresponding model -- let's call this a census regression. The standard errors that Stata gives you measure uncertainty about the census regression parameters, i.e., by how much your point estimates based on your sample might differ from the census parameters. If you sample everybody, your uncertainty is exactly zero. Of course that's not the case in practice when you sample but a tiny fraction of the total population. But still you might sample enough units in a cluster to be pretty sure what the contribution of that cluster is -- that's your independent piece of information for variance estimation (i.e., measuring uncertainty). Even if you took a different sample from that same cluster, you would get roughly the same number. That's why you want to treat the cluster as the independent piece of information. If all your clusters give about the same picture, you will get tight standard e!
 rrors; if the clusters are all over, the standard errors will be large.

You can get a very useful measure of how much impact your sampling desigh has had on your estimation by typing -estat effect- after your
-svy- command. It prints DEFF, the design effect, and MEFF, misspecification effect. The first one shows how much the variances of the estimates change because of your complex sampling plan: this is the ratio of the actual variance to the varaince obtained assuming independent data. The second one is measuring by how much you would be mistaken if you applied a naive variance formula. DEFF is somewhat better understood, in general. If you have numbers smaller than 1, you have efficiency gains because of your clever sample design. If you have numbers more than say 3 or 5, your sampling design did not allow you to get a lot of information. Usually that's a consequence of tight clustering or wildly different weights (or both). Sometimes DEFF is interpreted as the efficient sample size: you need [your actual sample size]/DEFF observations in a simple random sample to get the same accuracy of the results. (Keep in mind that SRS are hell of a lot of trouble to set up -- you need th!
 e complete list of the population which you never have unless you are The Big Brother aka Census Bureau :)). The largest DEFF I've seen in my practice was about a 100. This was a village level characteristic, access to tap water. Once you have a piped well in the village, every respondent is a 1 on that variable; if you don't have a pipe, everybody is 0. So instead of the total sample size of ~10,000 individuals, I only had ~100 villages that contributed to the estimation of the % with access to tap water. In terms of my above explanation, everybody in the cluster give exactly the same answer, and I am 100% sure about everybody in the cluster having or not having access to tap water (no cluster level uncertainty). The independent piece of information for this variable is given at the cluster level, rather than an individual level. On the other hand, the variables that would reflect information, activity or decision making at the household or individual level, such as age or!
  contraception use, would have DEFFs of the order of 1.5 or so in the 

same data set.

Now, for your particular variable of interest, I imagine the knowledge of a park might be the cluster level variable (either there is a good park nearby, or there are none), while income is certainly a household level variable (although there would still be a tendency for income to be spatially correlated: there are rich neighborhoods, and there are poor neighborhoods). If you run -svy : mean- and -estat effects- on your park knowledge and income variables, I would imagine the first one would have a higher DEFF.

It's a shame you cannot find anybody to help you with statistics at UMN. Your statistics program is supposed to be one of the top 10 or so in the nation.

On Fri, Sep 4, 2009 at 8:09 AM, Jennifer Schmitt<[email protected]> wrote:
> Stas,
> Thank you for your answer, I've read the paper you suggested, but 
> unfortunately my statistical background is very limited (probably part 
> of why I've having a difficult time with this as it is), so I'm not 
> sure I followed the entire thing.  If I may, I have a follow-up 
> question.  If STATA only uses my villages (PSU) as my independent 
> pieces of information and uses all my household data as data to help 
> estimate my PSU and its variance does that mean STATA is not testing 
> for differences in households, but differences among villages?  In 
> other words, when I run my logistic regression and find income to be 
> positively associated with knowledge of a park does that really mean 
> that villages with higher incomes have a greater odds of knowledge or 
> that households with higher incomes have greater odds of knowledge?  I 
> want to be able to speak about households, but your explanation about 
> what STATA is doing made me worry that STATA is only telling me about 
> villages.  Again, thank you for your help, I honestly have not found 
> anyone else who can explain the "whys" behind STATA.

Stas Kolenikov, also found at
Small print: I use this email account for mailing lists only.

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index