»  Home »  Products »  Stata 17 »  Lasso with clustered data

# Lasso with clustered data

## Highlights

• ### Lasso predictions with clustered data for

• Lasso
• Elastic net
• Square-root lasso
• ### Lasso inference with clustered data for

• Partialing-out lasso models
• Cross-fit partialing-out lasso models
• Double-selection lasso models
• ### Cluster–robust standard errors for

• Partialing-out lasso models
• Cross-fit partialing-out lasso models
• Double-selection lasso models
• Treatment-effects lasso models

You can now account for clustered data in your lasso analysis. Ignoring clustering may lead to incorrect results in the presence of correlation between observations within the same cluster. But with Stata's lasso commands—both those for prediction and those for inference—you can now obtain results that account for clustering.

With lasso commands for prediction, you simply add the cluster() option. For instance, type

. lasso linear y x1-x5000, cluster(idcode)

to account for possible correlation between observations with the same idcode during model selection. You can do this with lasso models other than linear, such as logit or Poisson, and with variable-selection methods other than lasso, such as elastic net and square-root lasso.

With lasso commands for inference, you add the vce(cluster) option. For instance, type

. poregress y x1, controls(x2-x5000) vce(cluster idcode)

to produce clusterâ€“robust standard errors that account for clustering in idcode using partialing-out lasso for linear outcomes. The vce(cluster) option is supported with all inferential lasso commands, including the new command for treatment-effects estimation with lasso.

## Let's see it work

We want to fit a linear-lasso model for the log of wages (ln_wage) using a set of variables and their second-order interactions. We have data on the individual's age, age; work experience, experience; job tenure, tenure; whether the individual lives in a rural area, rural; if they live in the South, south; and if they have no college education, nocollege. We want a good prediction of the log of wages given these controls.

We have repeated observations of individuals over time; we have clustered data. Each individual is identified by idcode.

We define the global macro $vars as variables and$controls as the full list of control variables.

. global vars c.(age tenure experience) i.(rural south nocollege)
. global controls ($vars)##($vars)

We fit the lasso model and specify option cluster(idcode) to account for clustering, and we specify option rseed(1234) to make the results reproducible.

. lasso linear ln_wage $controls, cluster(idcode) rseed(1234) 10-fold cross-validation with 100 lambdas ... (output omitted) ... change in the deviance stopping tolerance reached ... last lambda selected Lasso linear model No. of obs = 28,093 No. of covariates = 45 Cluster: idcode No. of clusters = 4,699 Selection: Cross-validation No. of CV folds = 10 No. of Out-of- CV mean nonzero sample prediction ID Description lambda coef. R-squared error 1 first lambda .2261424 0 0.0010 .2526964 81 lambda before .0001325 23 0.3088 .1748403 * 82 selected lambda .0001207 23 0.3088 .1748393 83 lambda after .00011 23 0.3088 .1748406 92 last lambda .0000476 24 0.3087 .1748552 * lambda selected by cross-validation  There are 4,699 clusters. Behind the scenes, the cross-validation procedure draws random samples by idcode to arrive at the optimal lambda. We could now use the predict command to get predictions of ln_wage. Suppose we are not interested solely in prediction. Say we want to know the effect of job tenure (tenure) on log wages (ln_wage). All the other variables are treated as potential controls, which lasso may include or exclude from the model. Lasso for inference allows us to obtain the estimate of the effect of tenure and its standard error. Because individuals are correlated over time, we would like to use cluster–robust standard errors at the idcode level. We fit a linear model with double-selection lasso methods by using the dsregress command. First, we define the global macro$vars2 as the uninteracted control variables, and we interact them to form the complete set of controls in $controls2. . global vars2 c.(age experience) i.(rural south nocollege) . global controls2 ($vars2)##($vars2) To fit the model and estimate cluster–robust standard errors, we use dsregress and specify the option vce(cluster idcode). . dsregress ln_wage tenure, controls($controls2) vce(cluster idcode)

Estimating lasso for ln_wage using plugin
Estimating lasso for tenure using plugin

Double-selection linear model           Number of obs               =      28,093
Number of controls          =          35
Number of selected controls =          10
Wald chi2(1)                =      195.87
Prob > chi2                 =      0.0000

(Std. err. adjusted for 4,699 clusters in idcode)

Robust
ln_wage      Coefficient      std. err.     z      P>|z|   [95% conf. interval]

tenure       .0228961        .001636     14.00    0.000    .0196897  .0261025

Note: Chi-squared test is a wald test of the coefficients of the variables
of interest jointly equal to zero. Lassos select controls for model
estimation. Type lassoinfo to see number of selected variables in each
lasso.
Note: Lassos are performed accounting for clusters in idcode


The .02 point estimate means that an increase of one year in job tenure would increase the log of wage by .02. The standard-error estimate is robust to the correlated observations within the cluster.

Stata Lasso Reference Manual

[LASSO] inference intro