 »  Home »  Products »  Stata 16 »  Lasso for prediction and model selection

# Lasso for prediction and model selection

## Highlights

• Estimators
• Lasso
• Square-root lasso
• Elastic net
• Models
• Linear
• Logit
• Probit
• Poisson
• Selection methods
• Cross-validation
• Plugin
• User-specified
• Postestimation
• Cross-validation function plots
• Coefficient path plots
• Select different λ
• Tables of variables as they enter and leave model
• Measures of fit by λ
• Compare fit across multiple lassos
• Helper commands
• Documentation
• 350-page [LASSO] manual for lasso for prediction and lasso for inference
See all features

We are faced with more and more data, often with many, and poorly described or understood, variables. We can even have more variables than we do data. Classical techniques break down when applied to such data.

The lasso was designed to sift through this kind of data and extract features that have the ability to predict outcomes.

Stata gives you the tools to use lasso for predicton and for characterizing the groups and patterns in your data (model selection). Use the lasso itself to select the variables that have real information about your response variable. Use split-sampling and goodness of fit to be sure the features you find generalize outside of your training (estimation) sample.

With the lasso command, you specify potential covariates, and it selects the covariates to appear in the model. The fitted model is suitable for making out-of-sample predictions but not directly applicable for statistical inference. If inference is your interest, see our description of Lasso for inference.

There are lots of lasso commands. Here are the most important ones for prediction.

You have an outcome y and variables x1-x1000. Among them might be a subset good for predicting y. Lasso attempts to find them. Type

. lasso linear y x1-x1000


To see the variables selected, type

. lassocoef


To make predictions with new data, type

. use newdata
. predict yhat


To see the fit in the new data, type

. lassogof


Lasso fits logit, probit, and Poisson models too.

. lasso logit z x1-x1000
. lasso probit z x1-x1000
. lasso poisson c x1-x1000


And it fits elastic-net models.

. elasticnet linear y x1-x1000
. elasticnet logit z x1-x1000
. elasticnet probit z x1-x1000
. elasticnet poisson c x1-x1000


Because ridge regression is a special case of elastic net, it fits ridge regressions too.

Square-root lasso is a variant of lasso for linear models.

. sqrtlasso y x1-x1000


You can force the selection of variables such as x1-x4.

. lasso linear y (x1-x4) x5-x1000


After fitting a lasso, you can use the postlasso commands.

. lassoknots                 table of estimated models by lambda
. lassocoef                  selected variables
. lassogof                   goodness of fit
. lassoselect lambda = 0.1   select model for another lambda

. coefpath                   plot coefficient path
. cvplot                     plot cross-validation function


And then there are features that will make it easier to do all the above. Need to split your data into training and testing samples? Type

. splitsample, generate(sample) nsplit(2)


Need to manage large variable lists? You do. We typed x1-x1000 above, but your variables will have real names, and you do not want to type them all. Use the vl commands to create lists of variables:

. vl set                       // creates vlcontinuous, vlcategorical, ...
. vl create myconts       = vlcontinuous
. vl modify myconts       = myconts - (kl srh srd polyt)
. vl create myfactors     = vlcategorical
. vl substitute myvarlist = i.myfactors myconts i.myfactors#c.myconts


The # sign creates interactions.

We just created myvarlist, which is ready for use in a lasso command such as

. lasso linear y $myvarlist  ## Let's see it work We are going to show you three examples. 1. Lasso with λ selected by cross-validation. 2. The same lasso, but we select λ to minimize the BIC. 3. The same lasso, fit by adaptive lasso. And then we are going to compare them. Rather than typing . lasso linear q104 (i.gender i.q3 i.q4 i.q5) i.q2 i.q6 i.q7 i.q8 i.q9 i.q10 i.q11 i.q13 i.q14 i.q16 i.q17 i.q19 i.q25 i.q26 i.q29 i.q30 i.q32 i.q33 i.q34 i.q36 i.q37 i.q38 i.q40 i.q41 i.q42 i.q43 i.q44 i.q46 i.q47 i.q48 i.q49 i.q50 i.q51 i.q55 i.q56 i.q57 i.q58 i.q59 i.q61 i.q64 i.q65 i.q67 i.q68 i.q69 i.q71 i.q72 i.q73 i.q74 i.q75 i.q77 i.q78 i.q79 i.q82 i.q83 i.q84 i.q85 i.q86 i.q88 i.q89 i.q90 i.q91 i.q94 i.q95 i.q96 i.q97 i.q98 i.q100 i.q101 i.q102 i.q105 i.q108 i.q109 i.q110 i.q113 i.q114 i.q115 i.q116 i.q117 i.q118 i.q122 i.q123 i.q125 i.q126 i.q128 i.q130 i.q133 i.q134 i.q136 i.q137 i.q138 i.q140 i.q142 i.q143 i.q144 i.q145 i.q146 i.q147 i.q148 i.q149 i.q150 i.q151 i.q152 i.q153 i.q154 i.q155 i.q156 i.q158 i.q159 i.q160 i.q161 age q1 q15 q18 q20 q21 q22 q24 q31 q35 q45 q52 q53 q62 q63 q70 q76 q87 q93 q103 q111 q112 q120 q121 q129 q131 q132 q139 q157  we have used vl behind the scenes, so that we can type . lasso linear q104 ($idemographics) $ifactors$vlcontinuous


And so that we can compare the out-of-sample predictions for the three models, we have already split our sample in two by typing

. splitsample, generate(sample) nsplit(2) rseed(1234)


We will fit all three models on sample==1 and later compare predictions using sample==2.

### Example 1: Lasso with λ selected by cross-validation

To fit a lasso with the default cross-validation selection method, we type

. lasso linear q104 ($idemographics)$ifactors $vlcontinuous if sample == 1 10-fold cross-validation with 100 lambdas ... Grid value 1: lambda = .9109571 no. of nonzero coef. = 4 Folds: 1...5....10 CVF = 16.93341 output omitted Grid value 23: lambda = .1176546 no. of nonzero coef. = 74 Folds: 1...5....10 CVF = 12.17933 ... cross-validation complete ... minimum found Lasso linear model No. of obs = 458 No. of covariates = 273 Selection: Cross-validation No. of CV folds = 10 No. of Out-of- CV mean nonzero sample prediction ID Description lambda coef. R-squared error 1 first lambda .9109571 4 0.0147 16.93341 18 lambda before .1873395 42 0.2953 12.10991 * 19 selected lambda .1706967 49 0.2968 12.08516 20 lambda after .1555325 55 0.2964 12.09189 23 last lambda .1176546 74 0.2913 12.17933 * lambda selected by cross-validation.  Lambda (λ) is lasso's penalty parameter. Lasso fits a range of models, from models with no covariates to models with lots, corresponding to models with large λ to models with small λ. Lasso then selected a model. Because we did not specify otherwise, it used its default, cross-validation (CV) to choose model ID=19, which has λ=0.171. The model has 49 covariates. Cross-validation chooses the model that minimizes the cross-validation function. Here is a graph of it. . cvplot  We plan on comparing this model with two other models, so we will store these estimates. We will store them under the name cv. . estimates store cv  ### Example 2: The same lasso, but we select λ to minimize the BIC We can select the model corresponding to any λ we wish after fitting the lasso. Picking the λ that has the minimum Bayes information criterion (BIC) gives good predictions under certain conditions. First, we use lassoknots to display the BIC for each model: . lassoknots, display(nonzero osr2 bic) No. of Out-of- nonzero sample ID lambda coef. R-squared BIC 1 .9109571 4 0.0147 2618.642 2 .8300302 7 0.0236 2630.961 3 .7562926 8 0.0421 2626.254 4 .6891057 9 0.0635 2619.727 5 .6278874 10 0.0857 2611.577 6 .5721076 13 0.1110 2614.155 8 .4749738 14 0.1581 2588.189 9 .4327784 16 0.1785 2584.638 10 .3943316 18 0.1980 2580.891 11 .3593003 22 0.2170 2588.984 12 .327381 26 0.2340 2596.792 13 .2982974 27 0.2517 2586.521 14 .2717975 28 0.2669 2578.211 15 .2476517 32 0.2784 2589.632 16 .225651 35 0.2865 2593.753 17 .2056048 37 0.2919 2592.923 18 .1873395 42 0.2953 2609.975 * 19 .1706967 49 0.2968 2639.437 20 .1555325 55 0.2964 2663.451 21 .1417154 62 0.2952 2693.929 22 .1291258 66 0.2934 2707.174 23 .1176546 74 0.2913 2744.508 * lambda selected by cross-validation.  Second, to select the minimum BIC, we want ID=14, the one with 28 covariates. . lassoselect id = 14 ID = 14 lambda = .2717975 selected  We can redraw the CV plot: . cvplot  This graph is the same CV plot we saw earlier. We have just selected another point on the function. We plan on comparing this model with the previous model. We will store this model under the name minBIC. . estimates store minBIC  ### Example 3. The same lasso, fit by adaptive lasso Adaptive lasso is another selection technique that tends to select fewer covariates. It also uses cross-validation but runs multiple lassos. By default, it runs two. To fit an adaptive lasso, we use the same command and specify the additional option selection(adaptive): . lasso linear q104 ($idemographics) $ifactors$vlcontinuous

Lasso step 1 of 2:

10-fold cross-validation with 100 lambdas ...
Grid value 1:     lambda = .9109571   no. of nonzero coef. =       4
Folds: 1...5....10   CVF = 17.012
Grid value 2:     lambda = .8300302   no. of nonzero coef. =       7
[output omitted]
Grid value 24:    lambda = .1072025   no. of nonzero coef. =      78
Folds: 1...5....10   CVF = 12.40012
... cross-validation complete ... minimum found

Lasso step 2 of 2:

Evaluating up to 100 lambdas in grid ...
Grid value 1:     lambda = 51.68486   no. of nonzero coef. =       4
[output omitted]
Grid value 100:   lambda = .0051685   no. of nonzero coef. =      59

10-fold cross-validation with 100 lambdas ...
Fold  1 of 10:  10....20....30....40....50....60....70....80....90....100
[output omitted]
Fold 10 of 10:  10....20....30....40....50....60....70....80....90....100
... cross-validation complete

Lasso linear model                         No. of obs         =        458
No. of covariates  =        273
Selection: Adaptive                        No. of lasso steps =          2

No. of      Out-of-      CV mean
nonzero       sample   prediction
ID       Description      lambda     coef.    R-squared        error

25      first lambda    51.68486         4       0.0101     17.01083
77     lambda before    .4095937        46       0.3985     10.33691
* 78   selected lambda    .3732065        46       0.3987     10.33306
79      lambda after    .3400519        47       0.3985     10.33653
124       last lambda    .0051685        59       0.3677     10.86697

* lambda selected by cross-validation in final adaptive step.


Adaptive lasso selected a model with 46 covariates instead of the 49 selected by ordinary lasso.

We will store these results as adaptive.

. estimates store adaptive


## Comparison of results

We have three sets of results.

cv contains the model selected by CV.

minBIC contains the model selected by us that corresponds to the minimum BIC.

First, let's compare the variables each selected. The lassocoef command does this. We specify sort(coef, standardized) so that the variables with the largest absolute values of their coefficients are listed first.

. lassocoef cv minBIC adaptive, sort(coef, standardized) nofvlabel

0.q19       x         x          x
0.q85       x         x          x
1.q5       x         x          x
3.q156       x         x          x
0.q101       x         x          x
0.q88       x         x          x
0.q48       x         x          x
q22       x         x          x

q38
4        x         x          x

q139       x         x          x
0.q56       x         x          x
q31       x         x          x
0.q73       x         x          x
0.q96       x         x          x
1.gender       x         x          x
0.q50       x         x          x
1.q3       x         x          x
3.q16       x         x          x
2.q84       x         x          x
0.q43       x         x          x
0.q149       x         x          x
0.q159       x         x          x
3.q134       x         x          x
0.q49       x                    x
0.q115       x         x          x
0.q108       x         x          x
0.q109       x                    x
0.q140       x                    x
0.q91       x                    x

q38
3        x         x          x

q93       x                    x
0.q14       x                    x
0.q153       x                    x
0.q160       x         x          x
age       x                    x
q53       x                    x
2.q105       x
0.q102       x                    x
0.q154       x                    x
q111       x                    x
0.q142       x                    x
0.q55       x
0.q97       x

q65
4        x                    x

1.q110       x                    x
q70       x
_cons       x         x          x
0.q44                            x

Legend:
b - base level
e - empty cell
o - omitted
x - estimated


Start at the top and look down, and you will see that all three approaches selected the first 23 variables listed in the table, the variables with the largest coefficients.

Which model produces the best predictions? Let's do out-of-sample prediction to find out. We split our data into two samples at the outset for just this purpose. We fit the models on sample 1. We can compare predictions for sample 2.

The lassogof command reports goodness-of-fit statistics. We specify option postselection to compare predictions based on the postselection coefficients instead of the penalized coefficients. We specify option over(sample) so that lassogof calculates fit statistics for each sample separately.

. lassogof cv minBIC adaptive, over(sample) postselection

Penalized coefficients

Name             sample           MSE    R-squared        Obs

cv
1      8.652771       0.5065        503
2      14.58354       0.2658        493

minBIC
1      9.740229       0.4421        508
2      13.44496       0.3168        503

1      8.637575       0.5057        504
2      14.70756       0.2595        494



We compare MSE and R-squared for sample 2. minBIC did best by both measures.

## Tell me more

Read more about lasso for prediction in the Stata Lasso Reference Manual; see [LASSO] lasso intro.

See [D] splitsample for more about the splitsample command.

See [D] vl for more about the vl command for constructing long variable lists.

Also see Bayesian lasso.