[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: When number of regressors greater than the number of clusters in OLS regression

From   "Schaffer, Mark E" <[email protected]>
To   <[email protected]>
Subject   st: RE: When number of regressors greater than the number of clusters in OLS regression
Date   Mon, 1 Sep 2008 17:57:07 +0100


> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Divya Balasubramaniam
> Sent: Monday, September 01, 2008 5:35 PM
> To: [email protected]
> Subject: st: When number of regressors greater than the 
> number of clusters in OLS regression
> Dear Dr.Schaffer,
> I am using clustering in my analysis and I am having some 
> trouble understanding some of the important issues. I have 
> read several papers you have written on clustering issues and 
> hence I am emailing you to seek help. 
> I am doing a district level analysis for the census year 
> 2001. I have 436 districts in total coming from 17 States. I 
> run an OLS regression of Share of households having tap water 
> access on several controls variables (I have about 25 
> Regressors).  I use the STATA command areg Y on X, 
> absorb(State) cluster(state). I have the state fixed effects 
> and clustered by State. 
> My question is: I have more regresors(25) than the number of 
> clusters(17). I also find in the STATA output that I have 
> F-stat missing. I would like to seek your advice on whether I 
> can make inference by looking at the individual coefficient 
> estimates and the reported robust Standard errors. I did see 
> your comment on this issue on the STATA listserv. However, I 
> could not find answers as to how to fix this problem of 
> having more regressors than the number of clusters.

I have done a bit of work on this with Austin Nichols.  Austin's
presentation at the 2007 UK Stata User Group meeting is available here:

Your question comes up on Statalist from time to time, e.g.,

Vince Wiggins' posting to Statalist is the most informative one I can
think of:

The short answer, as I understand it, is that having #regressors >
#clusters is not in itself a problem.  The problems are, instead:

1.  The cluster-robust VCV is asymptotically consistent in the number of
clusters.  You'd like a big number of clusters so that you can be
confident that the asymptotics are kicking in.  17 clusters is not very
far on the way to infinity, so the performance of the cluster-robust VCV
in your application could be poor.

2.  The rank of the cluster-robust VCV is given by the number of
clusters.  This means you can't test more hypotheses than you have
clusters.  More generally, testing multiple hypotheses is going to eat
up degrees of freedom, and you have very little to spare here (only 17
to start with).

Others on the list may also want to comment on this.


NB: General comment to Statalisters - I couldn't find a Stata FAQ on
this.  Did I miss it?  If not, should there be one?

> I will be extremely thankful if you can kindly help me in this regard.
> Sincerely,
> Divya.
> =======================================
> Divya Balasubramaniam
> Economics PhD Student
> Terry College of Business
> University of Georgia
> Athens -30602.
> *
> *   For searches and help try:
> *
> *
> *

Heriot-Watt University is a Scottish charity
registered under charity number SC000278.

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index