 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Regression with about 5000 (dummy) variables

 From John Antonakis To statalist@hsphsun2.harvard.edu Subject Re: st: Regression with about 5000 (dummy) variables Date Thu, 19 Apr 2012 22:30:28 +0200

```Hi:

Suppose the fixed-effects are idcode and south.

clear
webuse nlswork
xtset idcode

bys idcode : egen double cl_age_id = mean(age)
bys south : egen double cl_age_south = mean(age)

reg ln_w age i.south i.idcode, cluster(idcode)

This gives:

```
Linear regression Number of obs = 28502 F( 1, 4709) = . Prob > F = . R-squared = 0.6643 Root MSE = .30322
```
```
(Std. Err. adjusted for 4710 clusters in idcode)
```------------------------------------------------------------------------------
|               Robust
```
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
```-------------+----------------------------------------------------------------
```
age | .0181924 .0006658 27.32 0.000 .0168872 .0194977 1.south | -.0774963 .0195974 -3.95 0.000 -.1159164 -.0390761
```             |
idcode |
```
2 | -.3705713 .0006658 -556.59 0.000 -.3718765 -.369266
```[snip]
```
5159 | -.3570145 .0207303 -17.22 0.000 -.3976556 -.3163734
```             |
```
_cons | 1.561366 .0175324 89.06 0.000 1.526995 1.595738
```------------------------------------------------------------------------------

```
Notice, we have run out of DF (with a cluster-robust vce); the overall F-test cannot be computed. Had we not used a cluster robust vce, we would have had 4711 degrees of freedom in the numerator of the F-test:
```
reg ln_w age i.south i.idcode,

```
Source | SS df MS Number of obs = 28502 -------------+------------------------------ F(4711, 23790) = 9.99 Model | 4328.36582 4711 .918778566 Prob > F = 0.0000 Residual | 2187.3642 23790 .091944691 R-squared = 0.6643 -------------+------------------------------ Adj R-squared = 0.5978 Total | 6515.73002 28501 .228614085 Root MSE = .30322
```
------------------------------------------------------------------------------
```
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
```-------------+----------------------------------------------------------------
```
age | .0181924 .0003475 52.35 0.000 .0175113 .0188736 1.south | -.0774963 .0112551 -6.89 0.000 -.099557 -.0554355
```             |
idcode |
```
2 | -.3705713 .1237911 -2.99 0.003 -.6132097 -.1279328
```[snip]
```
5159 | -.3570145 .1446902 -2.47 0.014 -.6406165 -.0734125
```             |
```
_cons | 1.561366 .0880103 17.74 0.000 1.388861 1.733872
```------------------------------------------------------------------------------

When we use xtreg, we get:

iis idcode
xtregreg ln_w age i.south fe cluster(idcode)

```
Fixed-effects (within) regression Number of obs = 28502 Group variable: idcode Number of groups = 4710
```
```
R-sq: within = 0.1044 Obs per group: min = 1 between = 0.1233 avg = 6.1 overall = 0.1062 max = 15
```
```
F(2,4709) = 455.05 corr(u_i, Xb) = 0.0818 Prob > F = 0.0000
```
```
(Std. Err. adjusted for 4710 clusters in idcode)
```------------------------------------------------------------------------------
|               Robust
```
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
```-------------+----------------------------------------------------------------
```
age | .0181924 .0006083 29.91 0.000 .0169999 .019385 1.south | -.0774963 .0179053 -4.33 0.000 -.112599 -.0423935 _cons | 1.178256 .0190444 61.87 0.000 1.14092 1.215592
```-------------+----------------------------------------------------------------
sigma_u |  .39998991
sigma_e |  .30322383
rho |  .63504833   (fraction of variance due to u_i)
------------------------------------------------------------------------------

```
Notice, there is no F-test for the fixed-effects (usually printed on the bottom of the regression table).
```
Now, let's run it à la Mundlak:

. xtreg ln_w age cl*, cluster(idcode)

```
Random-effects GLS regression Number of obs = 28510 Group variable: idcode Number of groups = 4710
```
```
R-sq: within = 0.1032 Obs per group: min = 1 between = 0.1271 avg = 6.1 overall = 0.1133 max = 15
```
```
Wald chi2(3) = 1397.57 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
```
```
(Std. Err. adjusted for 4710 clusters in idcode)
```------------------------------------------------------------------------------
|               Robust
```
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
```-------------+----------------------------------------------------------------
```
age | .0182259 .0006078 29.99 0.000 .0170347 .0194171 cl_age_id | .0052512 .0012583 4.17 0.000 .002785 .0077174 cl_age_south | -.2571161 .0234542 -10.96 0.000 -.3030855 -.2111467 _cons | 8.445617 .6805643 12.41 0.000 7.111736 9.779499
```-------------+----------------------------------------------------------------
sigma_u |  .35875483
sigma_e |  .30323734
rho |  .58327855   (fraction of variance due to u_i)
------------------------------------------------------------------------------

```
The estimate (for age) is correct to three decimal places (it is a wee bit off probably due to the unbalanced panel).
```
With OLS à la Mundlak we have:

reg ln_w age cl*, cluster(idcode)

```
Linear regression Number of obs = 28510 F( 3, 4709) = 493.26 Prob > F = 0.0000 R-squared = 0.1182 Root MSE = .44897
```
```
(Std. Err. adjusted for 4710 clusters in idcode)
```------------------------------------------------------------------------------
|               Robust
```
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
```-------------+----------------------------------------------------------------
```
age | .0182808 .0006088 30.03 0.000 .0170872 .0194743 cl_age_id | .0050963 .0013429 3.80 0.000 .0024637 .007729 cl_age_south | -.4121572 .0236396 -17.44 0.000 -.4585019 -.3658125 _cons | 12.9671 .6872252 18.87 0.000 11.61982 14.31439
```------------------------------------------------------------------------------

```
The estimator still seems good. Notice, though, that the F-test numerator DFs are only 3. So that's what I meant when I said we save on DF (as compared to the OLS fixed-effects estimator).
```
Best,
J.

__________________________________________

Prof. John Antonakis
Department of Organizational Behavior
University of Lausanne
Internef #618
CH-1015 Lausanne-Dorigny
Switzerland
Tel ++41 (0)21 692-3438
Fax ++41 (0)21 692-3305
http://www.hec.unil.ch/people/jantonakis

Associate Editor
__________________________________________

On 19.04.2012 17:16, Austin Nichols wrote:
> John Antonakis <John.Antonakis@unil.ch>:
```
> The approach shown actually adds to the size of the matrix to be inverted.
```> You assert that
```
> "This will save you on degrees of freedom and computational requirements."
```> --can you clarify that claim?
> Your
>  xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
> is nearly the same as
>  xtreg y x1-x4, fe robust
> right? Note that inference is not identical, as the RE estimator
> does not "know" the means are estimated.
>
```
> On Thu, Apr 19, 2012 at 10:57 AM, John Antonakis <John.Antonakis@unil.ch> wrote:
```>> Hi:
>>
>> Let me let you in on a trick that is relatively unknown.
>>
```
>> One way around the problem of a huge amount of dummy variables is to use the
```>> Mundlak procedure:
>>
>> Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data.
>> Econometrica, 46(1), 69-85.
>>
>> ....for an intuitive explanation, see:
>>
```
>> Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making
```>> causal claims: A review and recommendations. The Leadership Quarterly,
>> 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf
>>
>> Basically, for each time varying independent variable (x1-x4), take the
>> cluster mean and include that in the regression.  That is, do:
>>
>> foreach var of varlist x1-x4 {
>> bys panelvar: egen cl_`var'=mean(`var')
>> }
>>
>> Then, run your regression like this:
>>
>> xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
>>
>> The Hausman test for fixed- versus random-effects is:
>>
>> testparm cl_x1-cl_x4
>>
>> This will save you on degrees of freedom and computational requirements.
```
>> This estimator is consistent. Try it out with a subsample of your dataset
```>> to see. Many econometricians have been amazed by this.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```