[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Survey design degrees of freedom help

From   Stas Kolenikov <[email protected]>
To   [email protected]
Subject   Re: st: RE: Survey design degrees of freedom help
Date   Thu, 3 Sep 2009 14:47:29 -0500


first of all, you don't need to subtract the constant term.
Stratification, in a sense, implies estimation of the fixed effect of
a stratum (although there's way more going on).

You are right in thinking about degrees of freedom independent pieces
of information provided by each cluster/village. In the extreme case
when you sample the complete cluster, you only have one number out of
it (in terms of contributions to the variability of your estimates),
no matter how many units you have in the cluster. In less extreme
cases with large sample sizes within cluster, you have that number
(the cluster mean, say) plus some relatively small amount of variation
around it, so you still have 1 d.f. contributed by the cluster (or
1+epsilon, if you like; although nobody really knows what this epsilon
might be). If you think about your sample (x1, ..., xn) as a vector in
n-dimensional space, the standard i.i.d. theory assumes that each
component of the vector can vary on its own, thus producing n degrees
of freedom for the sample, and n-1 degrees of freedom for variance
estimation (minus the overall mean). However in complex survey
sampling case, you have components corresponding to the same cluster
go together, at least to some extent, so your effective dimension is
much lower than n, and in the aforementioned extreme cases it is #PSUs
- #strata.

The issue of degrees of freedom has been discussed by Korn & Graubard,
although I am not sure whether it was their book
( or a paper
( If you are
really short on degrees of freedom, you can cheat and go to the next
level, and use SSUs instead of PSUs as the baseline for degrees of
freedom (so d.f. = #SSUs - #strata). That's what you've done, too,
with your 90 SSUs and 86 "cheated" d.f.s. They've outlined some other
approaches, but that's probably the one easiest to understand. Still I
would frown upon that, and if I were to referee a paper that does
this, I would have the authors write a half-page explanation of what
they are doing, and recognize that this is basically a wrong thing to

Now, where would those degrees of freedom matter in estimation
procedures? First, that's the number of terms added up to form the
covariance matrix, so the rank of that matrix is bounded by d.f.s. You
might still be able to run a regression with more terms, but Stata
will refuse conducting tests with more than d.f. terms. That is the
main concern you are voicing. Second, the d.f.s are also used in the
Student distribution for testing purposes. Nobody has ever justified
the use of Student distribution in this context (in the end, it is a
model-based derivation assuming normality, whereas the survey
inference is supposed to be fully non-parametric without any
distributional assumptions), but it seems to be working better as an
approximation to the realistic distributions.

Amazingly (and ashamingly), I cannot produce any references off the
top of my head that would deliver a clear explanation of those degrees
of freedom (I am not in my office where all the books are now). I hope
Korn & Graubard would give some references when they discuss the

I've seen things going either way with those degrees of freedom in my
analytical work and simulations. Sometimes, when your cluster effects
are not terribly strong, you are OK with #SSU-#strata (and if
#PSU-#strata is over a hundred, who cares, anyway). Other times, I've
seen the effective degrees of freedom around 5 or 10 when the nominal
degrees of freedom (#PSU-#strata) was close to a hundred -- I had some
problematic strata with extreme skewness and kurtosis, so whatever I
happened to sample there was driving the remainder of the sample.

On Thu, Sep 3, 2009 at 2:20 PM, Jennifer Schmitt<[email protected]> wrote:
> Thank you for your thoughts.  The 20 villages are the only independent
> pieces of information, the rest are related.  Is that the reason?  It just
> seems so restrictive.  I do have multi-stage sampling and my understanding
> of STATA is that it uses an "ultimate" cluster method, so unless my fpc are
> defined (which I don't define because they are all close to one), then STATA
> doesn't care about subsequent clusters because STATA incorporates all later
> stages of clustering in the main cluster.  Therefore there is no change in
> my df.  I have gone ahead and when necessary (because I need more df) I have
> defined my PSU as subvillage and get 90 (#subvillages) - 3 (#strata) - 1(for
> the constant = 86 df, but then I'm am ignoring the correlation of
> subvillages within a village.  I feel confident that I really only have 16
> df, it is just convincing others who do not know STATA or survey statistics
> that I have set up the statistical restrictions correctly and given the low
> df I have yet to convince others.  I've told them that the villages are the
> only independent units, but that just does not seem sufficient.  Any more
> thoughts by you or others is greatly appreciated, but regardless thanks for
> you thoughts thus far.

Stas Kolenikov, also found at
Small print: I use this email account for mailing lists only.

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index