Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Survey design degrees of freedom help


From   Stas Kolenikov <[email protected]>
To   [email protected]
Subject   Re: st: RE: Survey design degrees of freedom help
Date   Fri, 4 Sep 2009 10:34:43 -0500

Jennifer,

You are good to go with the interpretation of your regression
coefficients as they relate to the observation level. The way the
point estimates are obtained is the pretty much the same as in the
plain vanilla data (except that the weights attached to the
observations; but you can attach weights without the complex sample
structure, too). For instance, in regression case, you would have b =
(X' w X)^{-1} (X' w Y) where w is the diagonal matrix of weights. You
still use all the observations for point estimation, no compromises
are taken there. PSU information only contributes to the variance
estimation. Getting the point estimates and getting the variance
estimates can (and should) be thought as rather unrelated issues. One
of the exercises I give students in my survey statistics classes is to
give examples of designs that would produce the same point estimates
but different standard errors. I expect them to say something like
"SRS vs. stratified sample" or "SRS vs. cluster sample". So getting
the point estimates and getting the variance estimates can (and
should) be thought as rather unrelated issues.

Now, suppose you could collect the data for everybody in the
population, and estimate the corresponding model -- let's call this a
census regression. The standard errors that Stata gives you measure
uncertainty about the census regression parameters, i.e., by how much
your point estimates based on your sample might differ from the census
parameters. If you sample everybody, your uncertainty is exactly zero.
Of course that's not the case in practice when you sample but a tiny
fraction of the total population. But still you might sample enough
units in a cluster to be pretty sure what the contribution of that
cluster is -- that's your independent piece of information for
variance estimation (i.e., measuring uncertainty). Even if you took a
different sample from that same cluster, you would get roughly the
same number. That's why you want to treat the cluster as the
independent piece of information. If all your clusters give about the
same picture, you will get tight standard errors; if the clusters are
all over, the standard errors will be large.

You can get a very useful measure of how much impact your sampling
desigh has had on your estimation by typing -estat effect- after your
-svy- command. It prints DEFF, the design effect, and MEFF,
misspecification effect. The first one shows how much the variances of
the estimates change because of your complex sampling plan: this is
the ratio of the actual variance to the varaince obtained assuming
independent data. The second one is measuring by how much you would be
mistaken if you applied a naive variance formula. DEFF is somewhat
better understood, in general. If you have numbers smaller than 1, you
have efficiency gains because of your clever sample design. If you
have numbers more than say 3 or 5, your sampling design did not allow
you to get a lot of information. Usually that's a consequence of tight
clustering or wildly different weights (or both). Sometimes DEFF is
interpreted as the efficient sample size: you need [your actual sample
size]/DEFF observations in a simple random sample to get the same
accuracy of the results. (Keep in mind that SRS are hell of a lot of
trouble to set up -- you need the complete list of the population
which you never have unless you are The Big Brother aka Census Bureau
:)). The largest DEFF I've seen in my practice was about a 100. This
was a village level characteristic, access to tap water. Once you have
a piped well in the village, every respondent is a 1 on that variable;
if you don't have a pipe, everybody is 0. So instead of the total
sample size of ~10,000 individuals, I only had ~100 villages that
contributed to the estimation of the % with access to tap water. In
terms of my above explanation, everybody in the cluster give exactly
the same answer, and I am 100% sure about everybody in the cluster
having or not having access to tap water (no cluster level
uncertainty). The independent piece of information for this variable
is given at the cluster level, rather than an individual level. On the
other hand, the variables that would reflect information, activity or
decision making at the household or individual level, such as age or
contraception use, would have DEFFs of the order of 1.5 or so in the
same data set.

Now, for your particular variable of interest, I imagine the knowledge
of a park might be the cluster level variable (either there is a good
park nearby, or there are none), while income is certainly a household
level variable (although there would still be a tendency for income to
be spatially correlated: there are rich neighborhoods, and there are
poor neighborhoods). If you run -svy : mean- and -estat effects- on
your park knowledge and income variables, I would imagine the first
one would have a higher DEFF.

It's a shame you cannot find anybody to help you with statistics at
UMN. Your statistics program is supposed to be one of the top 10 or so
in the nation.

On Fri, Sep 4, 2009 at 8:09 AM, Jennifer Schmitt<[email protected]> wrote:
> Stas,
> Thank you for your answer, I've read the paper you suggested, but
> unfortunately my statistical background is very limited (probably part of
> why I've having a difficult time with this as it is), so I'm not sure I
> followed the entire thing.  If I may, I have a follow-up question.  If STATA
> only uses my villages (PSU) as my independent pieces of information and uses
> all my household data as data to help estimate my PSU and its variance does
> that mean STATA is not testing for differences in households, but
> differences among villages?  In other words, when I run my logistic
> regression and find income to be positively associated with knowledge of a
> park does that really mean that villages with higher incomes have a greater
> odds of knowledge or that households with higher incomes have greater odds
> of knowledge?  I want to be able to speak about households, but your
> explanation about what STATA is doing made me worry that STATA is only
> telling me about villages.  Again, thank you for your help, I honestly have
> not found anyone else who can explain the "whys" behind STATA.

-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index