- Estimators
- Lasso
- Square-root lasso
- Elastic net

- Models
- Linear
- Logit
- Probit
- Poisson

- Selection methods
- Cross-validation
- Adaptive lasso
- Plugin
- BIC
- User-specified

- Lasso with clustered data
- Postestimation
- Cross-validation function plots
- Coefficient path plots
- Select different λ
- Tables of variables as they enter and leave model
- Measures of fit by λ
- Compare fit across multiple lassos

- Helper commands
- Split data randomly
- Manage large numbers of variables

- Documentation
- 350-page [LASSO] manual for lasso for prediction and lasso for inference

We are faced with more and more data, often with many, and poorly described or understood, variables. We can even have more variables than we do data. Classical techniques break down when applied to such data.

The lasso is designed to sift through this kind of data and extract features that have the ability to predict outcomes.

Stata gives you the tools to use lasso for predicton and for characterizing the groups and patterns in your data (model selection). Use the lasso itself to select the variables that have real information about your response variable. Use split-sampling and goodness of fit to be sure the features you find generalize outside of your training (estimation) sample.

With the **lasso** command, you specify potential covariates,
and it selects the covariates to appear in the model. The fitted
model is suitable for making out-of-sample predictions but not
directly applicable for statistical inference. If inference
is your interest, see our description of Lasso for inference.

There are lots of lasso commands. Here are the most important ones for prediction.

You have an outcome ** y** and variables

.lasso linear y x1-x1000

To see the variables selected, type

.lassocoef

To make predictions with new data, type

.use newdata.predict yhat

To see the fit in the new data, type

.lassogof

Lasso fits logit, probit, and Poisson models too.

.lasso logit z x1-x1000.lasso probit z x1-x1000.lasso poisson c x1-x1000

And it fits elastic-net models.

.elasticnet linear y x1-x1000.elasticnet logit z x1-x1000.elasticnet probit z x1-x1000.elasticnet poisson c x1-x1000

Because ridge regression is a special case of elastic net, it fits ridge regressions too.

Square-root lasso is a variant of lasso for linear models.

.sqrtlasso y x1-x1000

You can force the selection of variables such as ** x1**-

.lasso linear y (x1-x4) x5-x1000

After fitting a lasso, you can use the postlasso commands.

.lassoknotstable of estimated models by lambda.lassocoefselected variables.lassogofgoodness of fit.lassoselect lambda = 0.1select model for another lambda.coefpathplot coefficient path.cvplotplot cross-validation function

And then there are features that will make it easier to do all the above. Need to split your data into training and testing samples? Type

.splitsample, generate(sample) nsplit(2)

Need to manage large variable lists? You do. We typed ** x1-x1000** above,
but your variables will have real names, and you do not want to type them all.
Use the

.vl set // creates vlcontinuous, vlcategorical, ....vl create myconts = vlcontinuous.vl modify myconts = myconts - (kl srh srd polyt).vl create myfactors = vlcategorical.vl substitute myvarlist = i.myfactors myconts i.myfactors#c.myconts

The **#** sign creates interactions.

We just created **myvarlist**, which is ready for use in a lasso
command such as

.lasso linear y $myvarlist

We are going to show you three examples.

- Lasso with λ selected by cross-validation.
- The same lasso, but we select λ to minimize the BIC.
- The same lasso, fit by adaptive lasso.

And then we are going to compare them.

Rather than typing

.lasso linear q104 (i.gender i.q3 i.q4 i.q5) i.q2 i.q6 i.q7 i.q8 i.q9 i.q10 i.q11 i.q13 i.q14 i.q16 i.q17 i.q19 i.q25 i.q26 i.q29 i.q30 i.q32 i.q33 i.q34 i.q36 i.q37 i.q38 i.q40 i.q41 i.q42 i.q43 i.q44 i.q46 i.q47 i.q48 i.q49 i.q50 i.q51 i.q55 i.q56 i.q57 i.q58 i.q59 i.q61 i.q64 i.q65 i.q67 i.q68 i.q69 i.q71 i.q72 i.q73 i.q74 i.q75 i.q77 i.q78 i.q79 i.q82 i.q83 i.q84 i.q85 i.q86 i.q88 i.q89 i.q90 i.q91 i.q94 i.q95 i.q96 i.q97 i.q98 i.q100 i.q101 i.q102 i.q105 i.q108 i.q109 i.q110 i.q113 i.q114 i.q115 i.q116 i.q117 i.q118 i.q122 i.q123 i.q125 i.q126 i.q128 i.q130 i.q133 i.q134 i.q136 i.q137 i.q138 i.q140 i.q142 i.q143 i.q144 i.q145 i.q146 i.q147 i.q148 i.q149 i.q150 i.q151 i.q152 i.q153 i.q154 i.q155 i.q156 i.q158 i.q159 i.q160 i.q161 age q1 q15 q18 q20 q21 q22 q24 q31 q35 q45 q52 q53 q62 q63 q70 q76 q87 q93 q103 q111 q112 q120 q121 q129 q131 q132 q139 q157

we have used **vl** behind the scenes, so that we can type

.lasso linear q104 ($idemographics) $ifactors $vlcontinuous

And so that we can compare the out-of-sample predictions for the three models, we have already split our sample in two by typing

.splitsample, generate(sample) nsplit(2) rseed(1234)

We will fit all three models on **sample==1** and later compare
predictions using **sample==2**.

To fit a lasso with the default cross-validation selection method, we type

.lasso linear q104 ($idemographics) $ifactors $vlcontinuous if sample == 110-fold cross-validation with 100 lambdas ... Grid value 1: lambda = .9109571 no. of nonzero coef. = 4 Folds: 1...5....10 CVF = 16.93341output omittedGrid value 23: lambda = .1176546 no. of nonzero coef. = 74 Folds: 1...5....10 CVF = 12.17933 ... cross-validation complete ... minimum found Lasso linear model No. of obs = 458 No. of covariates = 273 Selection: Cross-validation No. of CV folds = 10

No. of Out-of- CV mean | ||||

nonzero sample prediction | ||||

ID | Description lambda coef. R-squared error | |||

1 | first lambda .9109571 4 0.0147 16.93341 | |||

18 | lambda before .1873395 42 0.2953 12.10991 | |||

* 19 | selected lambda .1706967 49 0.2968 12.08516 | |||

20 | lambda after .1555325 55 0.2964 12.09189 | |||

23 | last lambda .1176546 74 0.2913 12.17933 | |||

Lambda (λ) is lasso's penalty parameter. Lasso fits a range of models, from models with no covariates to models with lots, corresponding to models with large λ to models with small λ.

Lasso then selected a model. Because we did not specify otherwise, it used its default, cross-validation (CV) to choose model ID=19, which has λ=0.171. The model has 49 covariates.

Cross-validation chooses the model that minimizes the cross-validation function. Here is a graph of it.

.cvplot

We plan on comparing this model with two other models, so we will
store these estimates. We will store them under the name **cv**.

.estimates store cv

We can select the model corresponding to any λ we wish after fitting the lasso. Picking the λ that has the minimum Bayes information criterion (BIC) gives good predictions under certain conditions.

To fit a lasso with minimum BIC, we use the same command and specify the additional option **selection(bic)**:

.lasso linear q104 ($idemographics) $ifactors $vlcontinuous if sample == 1, selection(bic)

Evaluating up to 100 lambdas in grid ... |

Grid value 1: lambda = .9109571 no. of nonzero coef. = 4 |

BIC = 2618.642 |

Grid value 2: lambda = .8300302 no. of nonzero coef. = 7 |

BIC = 2630.961 |

Grid value 3: lambda = .7562926 no. of nonzero coef. = 8 |

BIC = 2626.254 |

Grid value 4: lambda = .6891057 no. of nonzero coef. = 9 |

BIC = 2619.727 |

Grid value 5: lambda = .6278874 no. of nonzero coef. = 10 |

BIC = 2611.577 |

Grid value 6: lambda = .5721076 no. of nonzero coef. = 13 |

BIC = 2614.155 |

Grid value 7: lambda = .5212832 no. of nonzero coef. = 13 |

BIC = 2597.164 |

Grid value 8: lambda = .4749738 no. of nonzero coef. = 14 |

BIC = 2588.189 |

Grid value 9: lambda = .4327784 no. of nonzero coef. = 16 |

BIC = 2584.638 |

Grid value 10: lambda = .3943316 no. of nonzero coef. = 18 |

BIC = 2580.891 |

Grid value 11: lambda = .3593003 no. of nonzero coef. = 22 |

BIC = 2588.984 |

Grid value 12: lambda = .327381 no. of nonzero coef. = 26 |

BIC = 2596.792 |

Grid value 13: lambda = .2982974 no. of nonzero coef. = 27 |

BIC = 2586.521 |

Grid value 14: lambda = .2717975 no. of nonzero coef. = 28 |

BIC = 2578.211 |

Grid value 15: lambda = .2476517 no. of nonzero coef. = 32 |

BIC = 2589.632 |

Grid value 16: lambda = .225651 no. of nonzero coef. = 35 |

BIC = 2593.753 |

Grid value 17: lambda = .2056048 no. of nonzero coef. = 37 |

BIC = 2592.923 |

Grid value 18: lambda = .1873395 no. of nonzero coef. = 42 |

BIC = 2609.975 |

Grid value 19: lambda = .1706967 no. of nonzero coef. = 49 |

BIC = 2639.437 |

... selection BIC complete ... minimum found |

No. of | ||

nonzero In-sample | ||

ID | Description lambda coef. R-squared BIC | |

1 | first lambda .9109571 4 0.0308 2618.642 | |

13 | lambda before .2982974 27 0.3357 2586.521 | |

* 14 | selected lambda .2717975 28 0.3563 2578.211 | |

15 | lambda after .2476517 32 0.3745 2589.632 | |

19 | last lambda .1706967 49 0.4445 2639.437 | |

We can draw the BIC function plot:

.bicplot

We will store these results by minBIC.

.estimates store minBIC

Adaptive lasso is another selection technique that tends to select fewer covariates. It also uses cross-validation but runs multiple lassos. By default, it runs two.

To fit an adaptive lasso, we use the same command and specify the
additional option **selection(adaptive)**:

.lasso linear q104 ($idemographics) $ifactors $vlcontinuous if sample == 1, selection(adaptive)Lasso step 1 of 2: 10-fold cross-validation with 100 lambdas ... Grid value 1: lambda = .9109571 no. of nonzero coef. = 4 Folds: 1...5....10 CVF = 17.012 Grid value 2: lambda = .8300302 no. of nonzero coef. = 7 [output omitted] Grid value 24: lambda = .1072025 no. of nonzero coef. = 78 Folds: 1...5....10 CVF = 12.40012 ... cross-validation complete ... minimum found Lasso step 2 of 2: Evaluating up to 100 lambdas in grid ... Grid value 1: lambda = 51.68486 no. of nonzero coef. = 4 [output omitted] Grid value 100: lambda = .0051685 no. of nonzero coef. = 59 10-fold cross-validation with 100 lambdas ... Fold 1 of 10: 10....20....30....40....50....60....70....80....90....100 [output omitted] Fold 10 of 10: 10....20....30....40....50....60....70....80....90....100 ... cross-validation complete Lasso linear model No. of obs = 458 No. of covariates = 273 Selection: Adaptive No. of lasso steps = 2 Final adaptive step results

No. of Out-of- CV mean | ||||

nonzero sample prediction | ||||

ID | Description lambda coef. R-squared error | |||

25 | first lambda 51.68486 4 0.0101 17.01083 | |||

77 | lambda before .4095937 46 0.3985 10.33691 | |||

* 78 | selected lambda .3732065 46 0.3987 10.33306 | |||

79 | lambda after .3400519 47 0.3985 10.33653 | |||

124 | last lambda .0051685 59 0.3677 10.86697 | |||

Adaptive lasso selected a model with 46 covariates instead of the 49 selected by ordinary lasso.

We will store these results as **adaptive**.

.estimates store adaptive

We have three sets of results.

**cv** contains the model selected by CV.

**minBIC** contains the model selected by us that corresponds to the
minimum BIC.

**adaptive** contains the model selected by adaptive lasso.

First, let's compare the variables each selected. The
**lassocoef** command does this. We specify **sort(coef,
standardized)** so that the variables with the largest absolute
values of their coefficients are listed first.

.lassocoef cv minBIC adaptive, sort(coef, standardized) nofvlabel

cv minBIC adaptive | ||||

0.q19 | x x x | |||

0.q85 | x x x | |||

1.q5 | x x x | |||

3.q156 | x x x | |||

0.q101 | x x x | |||

0.q88 | x x x | |||

0.q48 | x x x | |||

q22 | x x x | |||

q38 | ||||

4 | x x x | |||

q139 | x x x | |||

0.q56 | x x x | |||

q31 | x x x | |||

0.q73 | x x x | |||

0.q96 | x x x | |||

1.gender | x x x | |||

0.q50 | x x x | |||

1.q3 | x x x | |||

3.q16 | x x x | |||

2.q84 | x x x | |||

0.q43 | x x x | |||

0.q149 | x x x | |||

0.q159 | x x x | |||

3.q134 | x x x | |||

0.q49 | x x | |||

0.q115 | x x x | |||

0.q108 | x x x | |||

0.q109 | x x | |||

0.q140 | x x | |||

0.q91 | x x | |||

q38 | ||||

3 | x x x | |||

q93 | x x | |||

0.q14 | x x | |||

0.q153 | x x | |||

0.q160 | x x x | |||

age | x x | |||

q53 | x x | |||

2.q105 | x | |||

0.q102 | x x | |||

0.q154 | x x | |||

q111 | x x | |||

0.q142 | x x | |||

0.q55 | x | |||

0.q97 | x | |||

q65 | ||||

4 | x x | |||

1.q110 | x x | |||

q70 | x | |||

_cons | x x x | |||

0.q44 | x | |||

Start at the top and look down, and you will see that all three approaches selected the first 23 variables listed in the table, the variables with the largest coefficients.

Which model produces the best predictions? Let's do out-of-sample prediction to find out. We split our data into two samples at the outset for just this purpose. We fit the models on sample 1. We can compare predictions for sample 2.

The **lassogof** command reports goodness-of-fit statistics. We specify
option **postselection** to compare predictions based on the postselection
coefficients instead of the penalized coefficients. We specify option
**over(sample)** so that **lassogof** calculates fit statistics
for each sample separately.

.lassogof cv minBIC adaptive, over(sample) postselectionPenalized coefficients

Name sample | MSE R-squared Obs | |||

cv | ||||

1 | 8.652771 0.5065 503 | |||

2 | 14.58354 0.2658 493 | |||

minBIC | ||||

1 | 9.740229 0.4421 508 | |||

2 | 13.44496 0.3168 503 | |||

adaptive | ||||

1 | 8.637575 0.5057 504 | |||

2 | 14.70756 0.2595 494 | |||

We compare MSE and *R*-squared for sample 2. **minBIC**
did best by both measures.

Learn more about Stata's lasso features.

Read more about lasso for prediction in the *Stata Lasso Reference Manual*; see [LASSO] lasso intro.

See [D] splitsample for more about the **splitsample** command.

See [D] vl for more about the **vl** command for constructing
long variable lists.

Also see Bayesian lasso.