High-dimensional fixed effects (HDFE)

High-dimensional fixed effects (HDFE) StataNow

StataNow

Order

<- See more new Stata features

Highlights

Absorb multiple high-dimensional categorical variables in

linear models with areg, absorb()
fixed-effects linear models with xtreg, fe absorb()

Choose an alternating projection algorithm:

Halperin
Cimmino

Gain speed by using option absorb()

Absorb not just one but multiple high-dimensional categorical variables in your linear and fixed-effects linear models with option absorb() of commands areg and xtreg. Enjoy remarkable speed gains over the traditional approach, which includes indicators for the categories of these variables in your models. Choose between different estimation methods. These features are part of StataNow™.

Let's see it work

-> Linear models with high-dimensional categorical variables

-> How much time are we saving?

-> More speed in fixed-effects linear models

Linear models with high-dimensional categorical variables

We often include categorical variables in our models as controls. These controls are necessary for model specification, but they are not the focus of our analysis. For instance, we may want to study the effect of import tariffs (imports) on yearly trade volume and include year, country, and industry as controls.

We could fit a linear regression with indicator variables for the three controls:

. regress trade imports i.year i.country i.industry

If we have 40 years of data, 160 countries, and 4-digit industry codes, we would be estimating roughly 1,200 parameters. This is time consuming, and only one parameter, the coefficient on imports, is of interest to our research question.

In StataNow, we can fit the same model in a fraction of the time by typing

. areg trade imports, absorb(year country industry)

Variables year, country, and industry are absorbed. areg already had this ability but for one variable; now we can add as many categorical variables as we want.

And if we want to fit a model with, say, industry fixed effects, we would type

. xtset industry year

. xtreg trade imports, fe absorb(year country)

How much time are we saving?

Below is a toy example with one million observations, one variable of interest (x), two categorical variables (a1 and a2) each with one thousand categories, and the id variable with one hundred thousand categories.

Previously, we could absorb only one variable and would have typed

. webuse hdfe
. quietly areg y x i.a1 i.a2, absorb(id)

specifying the variable with the largest number of categories, id, in the absorb() option. This takes about five minutes in Stata/MP and six minutes in Stata/SE.

Now we can absorb all three categorical variables by typing

. quietly areg y x, absorb(id a1 a2)

and this takes roughly 1 second in Stata/MP and 1.3 seconds in Stata/SE. (The times may vary slightly across different computers, but the time gains will be similar.)

The time gains are remarkable!

More speed in fixed-effects linear models

areg is the fastest command for models with high-dimensional categorical variables. But if you want to fit a fixed-effects model, xtreg, fe may be more appropriate.

Previously, to control for categorical variables with xtreg, fe, you had to specify them as indicator variables in the model. Now you can specify them in the new absorb() option, just as you do with areg; this will make xtreg, fe run much faster.

Continuing with the previous example, suppose you want to fit a linear model with id fixed effects. You would type

. xtset id
. xtreg y x, fe absorb(a1 a2) vce(cluster id)

and get

(header output omitted)



                             Robust                                              
           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x    -.5017659   .0010547  -475.73   0.000    -.5038331   -.4996986
       _cons     1.509526   .0031635   477.16   0.000     1.503326    1.515727

     sigma_u    1.4160503                                                     
     sigma_e    3.1659943                                                     
         rho    .16670092   (fraction of variance due to u_i)

xtreg, fe is slower than areg because it does more heavy lifting. In particular, it computes panel-level statistics that are used, for instance, to compute the variation between fixed effects (sigma_u). If you are not interested in sigma_u, you can save execution time by specifying the nosigmau option.

. xtreg y x, fe absorb(a1 a2) vce(cluster id) nosigmau
(header output omitted)



                             Robust                                              
           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x    -.5017659   .0010547  -475.73   0.000    -.5038331   -.4996986
       _cons     1.509526   .0031635   477.16   0.000     1.503326    1.515727

     sigma_e    3.1659943

Without the vce(cluster id) option, xtreg, fe reports a test that all panel effects, the \(u_{i}\)'s, are zero. In this case, specifying the nouitest option will suppress both the test and the estimation of sigma_u to save even more execution time.

Tell me more

Read more about how to handle high-dimensional categorical predictors in linear models in [R] areg and in fixed-effects linear models in [XT] xtreg.

View all the new features in Stata 18 and, in particular, New in linear models.

Made for data science.

Get started today.

Order

Upgrade

2024 Stata Conference · 1-2 August · Portland, OR

View the program →

View the program →