 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

st: Strange results with cluster option

 From "Bottan, Nicolas Luis" To "statalist@hsphsun2.harvard.edu" Subject st: Strange results with cluster option Date Mon, 27 Sep 2010 10:22:58 -0400

Hi everyone,

I’m obtaining strange results using the cluster option when performing OLS (basically, the standard error increases when increasing cluster size – there is large heterogeneity in cluster size).

I am attaching a simple Monte Carlo simulation in Stata to check whether the cluster option is working fine.

I construct a simple example where an outcome Y is the sum of a school random variable and a student random variable. Both have mean 0 and standard deviation 1.

I test the null hypothesis that the mean of Y is zero for each simulation. Because the null hypothesis is true, it should rejected only 5% of the times. Using the cluster option in Stata is rejected around 35% of the times. Alternatively, collapsing the data at the school level and then running Y on a constant (giving the same weight to all schools) the null is rejected 4% of the times.

Any thoughts?
Thanks!

Here is the code:
* THIS DO FILE GENERATES A MONTE CARLO SIMULATION TO CHECK WHETHER THE CLUSTER OPTION OF THE REG COMMAND IN STATA IS
* WORKING WELL. ALSO IT CHECKS TWO ALTERNATIVE OPTIONS TO ESTIMATE STANDARD ERRORS WHEN OBSERVATIONS ARE CLUSTERED

* TO THAT END, IT ASSUMES THAT:
* Yij=Vj+Uij
* where Y is some outcome variable defined at the student level, Vj is a school effect and Uij is a student effect

* V and U are independent and they are distributed normal with mean 0 and standard deviation 1.

* In the data, there are 100 schools. In 99 schools there is only one observation of a student. In one school there are
* observations of 101 students

* We test the null hypothesis that the mean of Y is zero. By construction this null is true. Then, we run 500 simulations
* and we record in how many cases we reject the null under three different estimation strategies. In the first one we
* use the cluster option in the regression command. In the second one we collapse the data at the school level (averaging Y)
* and then run a regression of Y on a constant weighting observations by the number of students in the school. In the third
* one we do the same procedure as in the second one but we give the same weight to all 100 schools

* As we run 500 simulations, the different alternative estimations, if they are working well, they should be rejecting the
* null approximately 25 times at the 5% level

set seed 111111

local ctarech1=0
local ctarech2=0
local ctarech3=0

foreach it of numlist 1/500 {
qui {
clear

set obs 200

gen j=_n
replace j=100 if j>100

bysort j: gen i=_n

gen v=rnormal()
gen u=rnormal()

replace v=-10 if i>1
egen aux=max(v),by(j)
gen v2=aux
replace v=v2
drop aux v2

gen y=v+u

reg y,cluster(j)
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech1=`ctarech1'+1
}

gen count=1
collapse y (sum) count,by(j)

reg y [pw=count]
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech2=`ctarech2'+1
}

reg y
local a=abs(_b[_cons]/_se[_cons])
if `a'>1.96 {
local ctarech3=`ctarech3'+1
}

}
}

display "it=`it'"
display "ctarech1=`ctarech1'"
display "ctarech2=`ctarech2'"
display "ctarech3=`ctarech3'"

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/