Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | John Antonakis <John.Antonakis@unil.ch> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Regression with about 5000 (dummy) variables |
Date | Thu, 19 Apr 2012 16:57:27 +0200 |
Hi: Let me let you in on a trick that is relatively unknown.One way around the problem of a huge amount of dummy variables is to use the Mundlak procedure:
Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data. Econometrica, 46(1), 69-85.
....for an intuitive explanation, see:Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf
Basically, for each time varying independent variable (x1-x4), take the cluster mean and include that in the regression. That is, do:
foreach var of varlist x1-x4 { bys panelvar: egen cl_`var'=mean(`var') } Then, run your regression like this: xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar) The Hausman test for fixed- versus random-effects is: testparm cl_x1-cl_x4This will save you on degrees of freedom and computational requirements. This estimator is consistent. Try it out with a subsample of your dataset to see. Many econometricians have been amazed by this.
HTH, J. __________________________________________ Prof. John Antonakis Faculty of Business and Economics Department of Organizational Behavior University of Lausanne Internef #618 CH-1015 Lausanne-Dorigny Switzerland Tel ++41 (0)21 692-3438 Fax ++41 (0)21 692-3305 http://www.hec.unil.ch/people/jantonakis Associate Editor The Leadership Quarterly __________________________________________ On 19.04.2012 16:39, Suryadipta Roy wrote: > Dear Statalisters, > > I am trying to run a fixed effects panel regression which has more > than 4000 dummies (based on theory in the gravity model literature in > inernational economics), and hence close to 5000 variables in the > regression. The coefficients of the dummy variables are not of any > interest. The code is as follows: xtreg y x1 x2...... imp_time_* > exp_time_*, fe cluster(panelvar), where panelvar has been set using - > xtset- , and imp_time and exp_time are importer-time and exporter-time > fixed effects respectively. However, the regression had run close to 2 > hours without generating any result at which I stopped it using > -Break- . I had set the memory to 5000m, and the matsize to 5000 using > -set- . > > My Stata specification is Stata/SE 11.2 for Windows (64-bit x86-64). > My PC specification: Processor- intel core i5-2430M CPU @ 2.40GhZ; > RAM- 8 GB, in a 64-bit OS. > > I would have greatly appreciated some help to find out if this is > normal for Stata to take this much time (or more) in the presence of a > large number of variables, and if there is a way to accomplish the > task faster. The gravity literature has suggested a couple of ways to > do this without the dummy variable approach, but I was trying to find > out if there is a better way to do it if I persist with the dummy > variables. Any help is greatly appreciated. > > Best regards, > Suryadipta. > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/