Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: An allowance for clustering should only increase SE's?

From   Joanne Garrett <>
Subject   Re: st: An allowance for clustering should only increase SE's?
Date   Wed, 13 Apr 2005 12:29:17 -0400

Everyone is very concerned about correcting s.e.'s for intraclass correlation, but the reality is sometimes the intraclass correlation is not present. Small positive values for rho may just be random. Given that, small negative values for rho are also possible. Negative values imply that people within a cluster (or PSU) are more different from each other than they would have been had you sampled randomly (i.e., not by cluster). Intuitively this makes no sense. If rho=0, the formula reduces to the original variance. If rho>0, then the variance, and therefore s.e., will be larger. If rho<0, the variance will be smaller. This is probably what is happening in your case. A conservative solution is to use the corrected s.e. when it is larger, but use the original s.e. when correcting makes it smaller. If others have a better solution, would love to hear it.

CDSC - Nichols, Tom wrote:

Dear statalist,

I have survey data of patients clustered within a sample of hospitals.

If I use:
svyset [pweight=weight], psu(hospital) clear
svymean outcome

this sometimes gives me a SE less than if I had ignored the clustering in
the sample:
svyset [pweight=weight], clear
svymean outcome
I suppose this is because there is rather less variation between cluster
means than within clusters.
But shouldn't an allowance for clustering in the sample increase the SE, not
reduce it?

I understand the formula for the sampling variance from a cluster sample is
var(C) = var(R)*[1+(N-1)*rho]
where var(C) is the variance from the cluster sample of equal size clusters,
var(R) is the variance from a simple random sample of the same size, N is the size of a cluster, and
rho is the intracluster correlation which must be between 0 and 1.
So var(C) must be greater than var(R).

For some outcomes using the psu( ) option increases the SE and for others
(as mentioned above) it decreases the SE.
The average change for all the outcomes is probably a small increase.
Should I use the psu( ) option only if it increases SE's?
Or should I always use the psu( ) option and just accept that this will
sometimes decrease SE's?

Any advice would be much appreciated.


The information contained in the EMail and any attachments is confidential
and intended solely and for the attention and use of the named
addressee(s). It may not be disclosed to any other person without the
express authority of the HPA, or the intended recipient, or both. If you
are not the intended recipient, you must not disclose, copy, distribute or
retain this message or any part of it. This footnote also confirms that
this EMail has been swept for computer viruses, but please re-sweep any
attachments before opening or saving. HTTP://

* For searches and help try:
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index