Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Direction of the effect of the cluster command on the standard error depends on the inclusion of a control variable

 From Austin Nichols <[email protected]> To [email protected] Subject Re: st: Direction of the effect of the cluster command on the standard error depends on the inclusion of a control variable Date Wed, 5 Jan 2011 20:53:34 -0500

```Jacob Felson <[email protected]> :
You should have at least 20 clusters and your smallest cluster should
be at least 5% of the data (i.e. 20 balanced clusters, or more
unbalanced clusters; see e.g.
http://www.stata.com/meeting/13uk/nichols_crse.pdf) to feel
comfortable with the cluster-robust SE estimator.  But to answer your
original question, the residuals are quite different after you include
z as a regressor, so the intracluster correlation can also be quite
different.

On Wed, Jan 5, 2011 at 8:02 PM, Stas Kolenikov <[email protected]> wrote:
> There are terrible small sample biases exhibited by -robust- and
> -cluster()- standard errors with small # of observations and clusters,
> respectively. As was noted by Justina, four clusters is SO far away
> from asymptotics that I wouldn't even consider the clustered standard
>
> On Wed, Jan 5, 2011 at 6:01 PM, Jacob Felson <[email protected]> wrote:
>> I wonder if anyone might be able to provide an explanation for the
>> following scenario.  I'm wondering why the direction of the change in
>> a standard error affected by the use of the cluster command depends on
>> the whether another control variable is included.  My inquiry is more
>> theoretical than practical, as I'm not wondering "what I should do"
>> but rather, simply "why is this happening?"   Let me elaborate below.
>>
>> Consider the following variables:
>>
>> y, the dependent variable
>> x, the independent variable of greatest interest, which is moderately
>> correlated with y and with z
>> z, another independent variable, which is correlated with y at about 0.5.
>>
>> nation - the data was collected in 4 different nations by different
>> organizations.
>>
>>
>> I am examining the standard errors (SE) for the coefficient of
>> variable x from the following four models:
>>
>> 1. Regress y on x, without clustering on nation.
>> 2. Regress y on x, with clustering on nation.
>>
>> 3. Regress y on x and z without clustering on nation.
>> 4. Regress y on x and z with clustering on nation.
>>
>>
>> The SE of the coefficient for x is LARGER in model 2 than in model 1.
>> This suggests there is a positive intercluster correlation.  That is,
>> the residuals are more similar to each other within nations than we
>> would expect by chance alone.  I suppose there is a preponderance of
>> positive residuals in some nations and a preponderance of negative
>> residuals in other nations.
>>
>> The SE of the coefficient for x is SMALLER in model 4 than in model 3.
>>  This suggests there is a negative intercluster correlation.  That is,
>> the residuals are less similar to each other within nations than we
>> would expect by chance.
>>
>>
>> So the effect that clustering on nation has on the SE of x depends on
>> whether a third variable, z, is controlled.  Why is this?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```

• Follow-Ups:
• References: