Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Regression with about 5000 (dummy) variables

From	John Antonakis <[email protected]>
To	[email protected]
Subject	Re: st: Regression with about 5000 (dummy) variables
Date	Thu, 19 Apr 2012 22:30:28 +0200

Hi:

Suppose the fixed-effects are idcode and south.

clear
webuse nlswork
xtset idcode

bys idcode : egen double cl_age_id = mean(age)
bys south : egen double cl_age_south = mean(age)

reg ln_w age i.south i.idcode, cluster(idcode)

This gives:

Linear regression Number of obs =28502F( 1, 4709)= .Prob > F= .R-squared =0.6643Root MSE =.30322

(Std. Err. adjusted for 4710 clusters inidcode)

------------------------------------------------------------------------------
             |               Robust

ln_wage | Coef. Std. Err. t P>|t| [95% Conf.Interval]

-------------+----------------------------------------------------------------

age | .0181924 .0006658 27.32 0.000 .0168872.01949771.south | -.0774963 .0195974 -3.95 0.000 -.1159164-.0390761

             |
      idcode |

2 | -.3705713 .0006658 -556.59 0.000 -.3718765-.369266

[snip]

5159 | -.3570145 .0207303 -17.22 0.000 -.3976556-.3163734

_cons | 1.561366 .0175324 89.06 0.000 1.5269951.595738

------------------------------------------------------------------------------

Notice, we have run out of DF (with a cluster-robust vce); the overallF-test cannot be computed. Had we not used a cluster robust vce, wewould have had 4711 degrees of freedom in the numerator of the F-test:


 reg ln_w age i.south i.idcode,

Source | SS df MS Number of obs= 28502-------------+------------------------------ F(4711, 23790)= 9.99Model | 4328.36582 4711 .918778566 Prob > F =0.0000Residual | 2187.3642 23790 .091944691 R-squared =0.6643-------------+------------------------------ Adj R-squared =0.5978Total | 6515.73002 28501 .228614085 Root MSE =.30322


------------------------------------------------------------------------------

ln_wage | Coef. Std. Err. t P>|t| [95% Conf.Interval]

-------------+----------------------------------------------------------------

age | .0181924 .0003475 52.35 0.000 .0175113.01887361.south | -.0774963 .0112551 -6.89 0.000 -.099557-.0554355

             |
      idcode |

2 | -.3705713 .1237911 -2.99 0.003 -.6132097-.1279328

[snip]

5159 | -.3570145 .1446902 -2.47 0.014 -.6406165-.0734125

_cons | 1.561366 .0880103 17.74 0.000 1.3888611.733872

------------------------------------------------------------------------------

When we use xtreg, we get:

iis idcode
xtregreg ln_w age i.south fe cluster(idcode)

Fixed-effects (within) regression Number of obs =28502Group variable: idcode Number of groups= 4710

R-sq: within = 0.1044 Obs per group: min= 1between = 0.1233 avg= 6.1overall = 0.1062 max= 15

F(2,4709) =455.05corr(u_i, Xb) = 0.0818 Prob > F =0.0000

(Std. Err. adjusted for 4710 clusters inidcode)

------------------------------------------------------------------------------
             |               Robust

ln_wage | Coef. Std. Err. t P>|t| [95% Conf.Interval]

-------------+----------------------------------------------------------------

age | .0181924 .0006083 29.91 0.000 .0169999.0193851.south | -.0774963 .0179053 -4.33 0.000 -.112599-.0423935_cons | 1.178256 .0190444 61.87 0.000 1.140921.215592

-------------+----------------------------------------------------------------
     sigma_u |  .39998991
     sigma_e |  .30322383
         rho |  .63504833   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Notice, there is no F-test for the fixed-effects (usually printed on thebottom of the regression table).


Now, let's run it à la Mundlak:

. xtreg ln_w age cl*, cluster(idcode)

Random-effects GLS regression Number of obs =28510Group variable: idcode Number of groups= 4710

R-sq: within = 0.1032 Obs per group: min= 1between = 0.1271 avg= 6.1overall = 0.1133 max= 15

Wald chi2(3) =1397.57corr(u_i, X) = 0 (assumed) Prob > chi2 =0.0000

(Std. Err. adjusted for 4710 clusters inidcode)

------------------------------------------------------------------------------
             |               Robust

ln_wage | Coef. Std. Err. z P>|z| [95% Conf.Interval]

-------------+----------------------------------------------------------------

age | .0182259 .0006078 29.99 0.000 .0170347.0194171cl_age_id | .0052512 .0012583 4.17 0.000 .002785.0077174cl_age_south | -.2571161 .0234542 -10.96 0.000 -.3030855-.2111467_cons | 8.445617 .6805643 12.41 0.000 7.1117369.779499

-------------+----------------------------------------------------------------
     sigma_u |  .35875483
     sigma_e |  .30323734
         rho |  .58327855   (fraction of variance due to u_i)
------------------------------------------------------------------------------

The estimate (for age) is correct to three decimal places (it is a weebit off probably due to the unbalanced panel).


With OLS à la Mundlak we have:

 reg ln_w age cl*, cluster(idcode)

Linear regression Number of obs =28510F( 3, 4709) =493.26Prob > F =0.0000R-squared =0.1182Root MSE =.44897

(Std. Err. adjusted for 4710 clusters inidcode)

------------------------------------------------------------------------------
             |               Robust

ln_wage | Coef. Std. Err. t P>|t| [95% Conf.Interval]

-------------+----------------------------------------------------------------

age | .0182808 .0006088 30.03 0.000 .0170872.0194743cl_age_id | .0050963 .0013429 3.80 0.000 .0024637.007729cl_age_south | -.4121572 .0236396 -17.44 0.000 -.4585019-.3658125_cons | 12.9671 .6872252 18.87 0.000 11.6198214.31439

------------------------------------------------------------------------------

The estimator still seems good. Notice, though, that the F-testnumerator DFs are only 3. So that's what I meant when I said we save onDF (as compared to the OLS fixed-effects estimator).

Best,
J.

__________________________________________

Prof. John Antonakis
Faculty of Business and Economics
Department of Organizational Behavior
University of Lausanne
Internef #618
CH-1015 Lausanne-Dorigny
Switzerland
Tel ++41 (0)21 692-3438
Fax ++41 (0)21 692-3305
http://www.hec.unil.ch/people/jantonakis

Associate Editor
The Leadership Quarterly
__________________________________________

On 19.04.2012 17:16, Austin Nichols wrote:
> John Antonakis <[email protected]>:
> The poster asked about multiple dimensions of fixed effects--how does
> the advice below relate?

> The approach shown actually adds to the size of the matrix to beinverted.

> You assert that

> "This will save you on degrees of freedom and computationalrequirements."

> --can you clarify that claim?
> Your
>  xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
> is nearly the same as
>  xtreg y x1-x4, fe robust
> right? Note that inference is not identical, as the RE estimator
> does not "know" the means are estimated.
>

> On Thu, Apr 19, 2012 at 10:57 AM, John Antonakis<[email protected]> wrote:

>> Hi:
>>
>> Let me let you in on a trick that is relatively unknown.
>>

>> One way around the problem of a huge amount of dummy variables is touse the

>> Mundlak procedure:
>>
>> Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data.
>> Econometrica, 46(1), 69-85.
>>
>> ....for an intuitive explanation, see:
>>

>> Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). Onmaking

>> causal claims: A review and recommendations. The Leadership Quarterly,
>> 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf
>>
>> Basically, for each time varying independent variable (x1-x4), take the
>> cluster mean and include that in the regression.  That is, do:
>>
>> foreach var of varlist x1-x4 {
>> bys panelvar: egen cl_`var'=mean(`var')
>> }
>>
>> Then, run your regression like this:
>>
>> xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
>>
>> The Hausman test for fixed- versus random-effects is:
>>
>> testparm cl_x1-cl_x4
>>
>> This will save you on degrees of freedom and computational requirements.

>> This estimator is consistent. Try it out with a subsample of yourdataset

>> to see. Many econometricians have been amazed by this.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Regression with about 5000 (dummy) variables
  - From: Suryadipta Roy <[email protected]>
- Re: st: Regression with about 5000 (dummy) variables
  - From: John Antonakis <[email protected]>
- Re: st: Regression with about 5000 (dummy) variables
  - From: Austin Nichols <[email protected]>

Prev by Date: Re: st: cluster analysis validation
Next by Date: st: xtmixed command
Previous by thread: Re: st: Regression with about 5000 (dummy) variables
Next by thread: Re: st: Regression with about 5000 (dummy) variables
Index(es):
- Date
- Thread