Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Survey design degrees of freedom help


From   Jennifer Schmitt <jorg0206@umn.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: Survey design degrees of freedom help
Date   Fri, 04 Sep 2009 08:09:37 -0500

Stas,
Thank you for your answer, I've read the paper you suggested, but unfortunately my statistical background is very limited (probably part of why I've having a difficult time with this as it is), so I'm not sure I followed the entire thing. If I may, I have a follow-up question. If STATA only uses my villages (PSU) as my independent pieces of information and uses all my household data as data to help estimate my PSU and its variance does that mean STATA is not testing for differences in households, but differences among villages? In other words, when I run my logistic regression and find income to be positively associated with knowledge of a park does that really mean that villages with higher incomes have a greater odds of knowledge or that households with higher incomes have greater odds of knowledge? I want to be able to speak about households, but your explanation about what STATA is doing made me worry that STATA is only telling me about villages. Again, thank you for your help, I honestly have not found anyone else who can explain the "whys" behind STATA.
Cheers,
Jennifer

Stas Kolenikov wrote:
Jennifer,

first of all, you don't need to subtract the constant term.
Stratification, in a sense, implies estimation of the fixed effect of
a stratum (although there's way more going on).

You are right in thinking about degrees of freedom independent pieces
of information provided by each cluster/village. In the extreme case
when you sample the complete cluster, you only have one number out of
it (in terms of contributions to the variability of your estimates),
no matter how many units you have in the cluster. In less extreme
cases with large sample sizes within cluster, you have that number
(the cluster mean, say) plus some relatively small amount of variation
around it, so you still have 1 d.f. contributed by the cluster (or
1+epsilon, if you like; although nobody really knows what this epsilon
might be). If you think about your sample (x1, ..., xn) as a vector in
n-dimensional space, the standard i.i.d. theory assumes that each
component of the vector can vary on its own, thus producing n degrees
of freedom for the sample, and n-1 degrees of freedom for variance
estimation (minus the overall mean). However in complex survey
sampling case, you have components corresponding to the same cluster
go together, at least to some extent, so your effective dimension is
much lower than n, and in the aforementioned extreme cases it is #PSUs
- #strata.

The issue of degrees of freedom has been discussed by Korn & Graubard,
although I am not sure whether it was their book
(http://www.citeulike.org/user/ctacmo/article/553280) or a paper
(http://www.citeulike.org/user/ctacmo/article/933864). If you are
really short on degrees of freedom, you can cheat and go to the next
level, and use SSUs instead of PSUs as the baseline for degrees of
freedom (so d.f. = #SSUs - #strata). That's what you've done, too,
with your 90 SSUs and 86 "cheated" d.f.s. They've outlined some other
approaches, but that's probably the one easiest to understand. Still I
would frown upon that, and if I were to referee a paper that does
this, I would have the authors write a half-page explanation of what
they are doing, and recognize that this is basically a wrong thing to
do.

Now, where would those degrees of freedom matter in estimation
procedures? First, that's the number of terms added up to form the
covariance matrix, so the rank of that matrix is bounded by d.f.s. You
might still be able to run a regression with more terms, but Stata
will refuse conducting tests with more than d.f. terms. That is the
main concern you are voicing. Second, the d.f.s are also used in the
Student distribution for testing purposes. Nobody has ever justified
the use of Student distribution in this context (in the end, it is a
model-based derivation assuming normality, whereas the survey
inference is supposed to be fully non-parametric without any
distributional assumptions), but it seems to be working better as an
approximation to the realistic distributions.

Amazingly (and ashamingly), I cannot produce any references off the
top of my head that would deliver a clear explanation of those degrees
of freedom (I am not in my office where all the books are now). I hope
Korn & Graubard would give some references when they discuss the
issue...

I've seen things going either way with those degrees of freedom in my
analytical work and simulations. Sometimes, when your cluster effects
are not terribly strong, you are OK with #SSU-#strata (and if
#PSU-#strata is over a hundred, who cares, anyway). Other times, I've
seen the effective degrees of freedom around 5 or 10 when the nominal
degrees of freedom (#PSU-#strata) was close to a hundred -- I had some
problematic strata with extreme skewness and kurtosis, so whatever I
happened to sample there was driving the remainder of the sample.

On Thu, Sep 3, 2009 at 2:20 PM, Jennifer Schmitt<jorg0206@umn.edu> wrote:
Thank you for your thoughts.  The 20 villages are the only independent
pieces of information, the rest are related.  Is that the reason?  It just
seems so restrictive.  I do have multi-stage sampling and my understanding
of STATA is that it uses an "ultimate" cluster method, so unless my fpc are
defined (which I don't define because they are all close to one), then STATA
doesn't care about subsequent clusters because STATA incorporates all later
stages of clustering in the main cluster.  Therefore there is no change in
my df.  I have gone ahead and when necessary (because I need more df) I have
defined my PSU as subvillage and get 90 (#subvillages) - 3 (#strata) - 1(for
the constant = 86 df, but then I'm am ignoring the correlation of
subvillages within a village.  I feel confident that I really only have 16
df, it is just convincing others who do not know STATA or survey statistics
that I have set up the statistical restrictions correctly and given the low
df I have yet to convince others.  I've told them that the villages are the
only independent units, but that just does not seem sufficient.  Any more
thoughts by you or others is greatly appreciated, but regardless thanks for
you thoughts thus far.



--
Jennifer Schmitt
PhD Candidate - Conservation Biology Program
University of Minnesota
100 Ecology Building
1987 Upper Buford Circle
St. Paul, MN 55108
jorg0206@umn.edu

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index