Home / Products / Stata 17 / Lasso with clustered data

This page announced the new features in Stata 17. Please see our Stata 19 page for the new features in Stata 19.

Lasso with clustered data

Highlights

Lasso predictions with clustered data for

Lasso
Elastic net
Square-root lasso

Lasso inference with clustered data for

Partialing-out lasso models
Cross-fit partialing-out lasso models
Double-selection lasso models

Cluster–robust standard errors for

Partialing-out lasso models
Cross-fit partialing-out lasso models
Double-selection lasso models
Treatment-effects lasso models

You can now account for clustered data in your lasso analysis. Ignoring clustering may lead to incorrect results in the presence of correlation between observations within the same cluster. But with Stata's lasso commands—both those for prediction and those for inference—you can now obtain results that account for clustering.

With lasso commands for prediction, you simply add the cluster() option. For instance, type

. lasso linear y x1-x5000, cluster(idcode)

to account for possible correlation between observations with the same idcode during model selection. You can do this with lasso models other than linear, such as logit or Poisson, and with variable-selection methods other than lasso, such as elastic net and square-root lasso.

With lasso commands for inference, you add the vce(cluster) option. For instance, type

. poregress y x1, controls(x2-x5000) vce(cluster idcode)

to produce cluster–robust standard errors that account for clustering in idcode using partialing-out lasso for linear outcomes. The vce(cluster) option is supported with all inferential lasso commands, including the new command for treatment-effects estimation with lasso.

Let's see it work

We want to fit a linear-lasso model for the log of wages (ln_wage) using a set of variables and their second-order interactions. We have data on the individual's age, age; work experience, experience; job tenure, tenure; whether the individual lives in a rural area, rural; if they live in the South, south; and if they have no college education, nocollege. We want a good prediction of the log of wages given these controls.

We have repeated observations of individuals over time; we have clustered data. Each individual is identified by idcode.

We define the global macro $vars as variables and $controls as the full list of control variables.

. global vars c.(age tenure experience) i.(rural south nocollege)

. global controls ($vars)##($vars)

We fit the lasso model and specify option cluster(idcode) to account for clustering, and we specify option rseed(1234) to make the results reproducible.

. lasso linear ln_wage $controls, cluster(idcode) rseed(1234)

10-fold cross-validation with 100 lambdas ...
(output omitted)
... change in the deviance stopping tolerance reached ... last lambda selected


Lasso linear model                          No. of obs        =    28,093
                                            No. of covariates =        45
Cluster: idcode                             No. of clusters   =     4,699
Selection: Cross-validation                 No. of CV folds   =        10



                                          No. of      Out-of-     CV mean
                                         nonzero       sample  prediction
      ID       Description     lambda      coef.    R-squared       error

       1      first lambda   .2261424        0        0.0010     .2526964
      81     lambda before   .0001325       23        0.3088     .1748403
    * 82   selected lambda   .0001207       23        0.3088     .1748393
      83      lambda after   .00011         23        0.3088     .1748406
      92       last lambda   .0000476       24        0.3087     .1748552

* lambda selected by cross-validation

There are 4,699 clusters. Behind the scenes, the cross-validation procedure draws random samples by idcode to arrive at the optimal lambda.

We could now use the predict command to get predictions of ln_wage.

Suppose we are not interested solely in prediction. Say we want to know the effect of job tenure (tenure) on log wages (ln_wage). All the other variables are treated as potential controls, which lasso may include or exclude from the model. Lasso for inference allows us to obtain the estimate of the effect of tenure and its standard error. Because individuals are correlated over time, we would like to use cluster–robust standard errors at the idcode level. We fit a linear model with double-selection lasso methods by using the dsregress command.

First, we define the global macro $vars2 as the uninteracted control variables, and we interact them to form the complete set of controls in $controls2.

. global vars2 c.(age experience) i.(rural south nocollege)

. global controls2 ($vars2)##($vars2)

To fit the model and estimate cluster–robust standard errors, we use dsregress and specify the option vce(cluster idcode).

. dsregress ln_wage tenure, controls($controls2) vce(cluster idcode)

Estimating lasso for ln_wage using plugin
Estimating lasso for tenure using plugin

Double-selection linear model           Number of obs               =      28,093
                                        Number of controls          =          35
                                        Number of selected controls =          10
                                        Wald chi2(1)                =      195.87
                                        Prob > chi2                 =      0.0000

                                (Std. err. adjusted for 4,699 clusters in idcode)


                                                                          
                                Robust                                    
 ln_wage      Coefficient      std. err.     z      P>|z|   [95% conf. interval]

  tenure       .0228961        .001636     14.00    0.000    .0196897  .0261025


Note: Chi-squared test is a wald test of the coefficients of the variables
      of interest jointly equal to zero. Lassos select controls for model
      estimation. Type lassoinfo to see number of selected variables in each
      lasso.
Note: Lassos are performed accounting for clusters in idcode

The .02 point estimate means that an increase of one year in job tenure would increase the log of wage by .02. The standard-error estimate is robust to the correlated observations within the cluster.

Additional resources

Learn more about Stata's lasso features.

Stata Lasso Reference Manual

[LASSO] inference intro

This page announced the new features in Stata 17. Please see our Stata 19 page for the new features in Stata 19.

Lasso with clustered data

Highlights

Lasso predictions with clustered data for

Lasso inference with clustered data for

Cluster–robust standard errors for

Let's see it work

Additional resources

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies


		No. of Out-of- CV mean
		nonzero sample prediction
ID		Description lambda coef. R-squared error

1		first lambda .2261424 0 0.0010 .2526964
81		lambda before .0001325 23 0.3088 .1748403
* 82		selected lambda .0001207 23 0.3088 .1748393
83		lambda after .00011 23 0.3088 .1748406
92		last lambda .0000476 24 0.3087 .1748552



		Robust
ln_wage		Coefficient std. err. z P>\|z\| [95% conf. interval]

tenure		.0228961 .001636 14.00 0.000 .0196897 .0261025

Stata/MP4 Annual License (download)

This page announced the new features in Stata 17. Please see our Stata 19 page for the new features in Stata 19.

Lasso with clustered data

Highlights

Lasso predictions with clustered data for

Lasso inference with clustered data for

Cluster–robust standard errors for

Let's see it work

Additional resources

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies