Home  /  Products  /  Stata 17  /  Lasso with clustered data
This page announced the new features in Stata 17. Please see our Stata 18 page for the new features in Stata 18.

Lasso with clustered data

Highlights

  • Lasso predictions with clustered data for

    • Lasso
    • Elastic net
    • Square-root lasso
  • Lasso inference with clustered data for

    • Partialing-out lasso models
    • Cross-fit partialing-out lasso models
    • Double-selection lasso models
  • Cluster–robust standard errors for

    • Partialing-out lasso models
    • Cross-fit partialing-out lasso models
    • Double-selection lasso models
    • Treatment-effects lasso models

You can now account for clustered data in your lasso analysis. Ignoring clustering may lead to incorrect results in the presence of correlation between observations within the same cluster. But with Stata's lasso commands—both those for prediction and those for inference—you can now obtain results that account for clustering.

With lasso commands for prediction, you simply add the cluster() option. For instance, type

. lasso linear y x1-x5000, cluster(idcode)

to account for possible correlation between observations with the same idcode during model selection. You can do this with lasso models other than linear, such as logit or Poisson, and with variable-selection methods other than lasso, such as elastic net and square-root lasso.

With lasso commands for inference, you add the vce(cluster) option. For instance, type

. poregress y x1, controls(x2-x5000) vce(cluster idcode)

to produce cluster–robust standard errors that account for clustering in idcode using partialing-out lasso for linear outcomes. The vce(cluster) option is supported with all inferential lasso commands, including the new command for treatment-effects estimation with lasso.


Let's see it work

We want to fit a linear-lasso model for the log of wages (ln_wage) using a set of variables and their second-order interactions. We have data on the individual's age, age; work experience, experience; job tenure, tenure; whether the individual lives in a rural area, rural; if they live in the South, south; and if they have no college education, nocollege. We want a good prediction of the log of wages given these controls.

We have repeated observations of individuals over time; we have clustered data. Each individual is identified by idcode.

We define the global macro $vars as variables and $controls as the full list of control variables.

. global vars c.(age tenure experience) i.(rural south nocollege)
. global controls ($vars)##($vars)

We fit the lasso model and specify option cluster(idcode) to account for clustering, and we specify option rseed(1234) to make the results reproducible.

. lasso linear ln_wage $controls, cluster(idcode) rseed(1234)

10-fold cross-validation with 100 lambdas ...
(output omitted)
... change in the deviance stopping tolerance reached ... last lambda selected


Lasso linear model                          No. of obs        =    28,093
                                            No. of covariates =        45
Cluster: idcode                             No. of clusters   =     4,699
Selection: Cross-validation                 No. of CV folds   =        10

No. of Out-of- CV mean
nonzero sample prediction
ID Description lambda coef. R-squared error
1 first lambda .2261424 0 0.0010 .2526964
81 lambda before .0001325 23 0.3088 .1748403
* 82 selected lambda .0001207 23 0.3088 .1748393
83 lambda after .00011 23 0.3088 .1748406
92 last lambda .0000476 24 0.3087 .1748552
* lambda selected by cross-validation

There are 4,699 clusters. Behind the scenes, the cross-validation procedure draws random samples by idcode to arrive at the optimal lambda.

We could now use the predict command to get predictions of ln_wage.

Suppose we are not interested solely in prediction. Say we want to know the effect of job tenure (tenure) on log wages (ln_wage). All the other variables are treated as potential controls, which lasso may include or exclude from the model. Lasso for inference allows us to obtain the estimate of the effect of tenure and its standard error. Because individuals are correlated over time, we would like to use cluster–robust standard errors at the idcode level. We fit a linear model with double-selection lasso methods by using the dsregress command.

First, we define the global macro $vars2 as the uninteracted control variables, and we interact them to form the complete set of controls in $controls2.

. global vars2 c.(age experience) i.(rural south nocollege)
. global controls2 ($vars2)##($vars2)

To fit the model and estimate cluster–robust standard errors, we use dsregress and specify the option vce(cluster idcode).

. dsregress ln_wage tenure, controls($controls2) vce(cluster idcode)

Estimating lasso for ln_wage using plugin
Estimating lasso for tenure using plugin

Double-selection linear model           Number of obs               =      28,093
                                        Number of controls          =          35
                                        Number of selected controls =          10
                                        Wald chi2(1)                =      195.87
                                        Prob > chi2                 =      0.0000

                                (Std. err. adjusted for 4,699 clusters in idcode)
Robust
ln_wage Coefficient std. err. z P>|z| [95% conf. interval]
tenure .0228961 .001636 14.00 0.000 .0196897 .0261025
Note: Chi-squared test is a wald test of the coefficients of the variables of interest jointly equal to zero. Lassos select controls for model estimation. Type lassoinfo to see number of selected variables in each lasso. Note: Lassos are performed accounting for clusters in idcode

The .02 point estimate means that an increase of one year in job tenure would increase the log of wage by .02. The standard-error estimate is robust to the correlated observations within the cluster.


Additional resources

Learn more about Stata's lasso features.

Stata Lasso Reference Manual

[LASSO] inference intro