Lasso for prediction and model selection

Order

Watch video demo

<- See Stata's other features

Highlights

Estimators

Lasso
Square-root lasso
Elastic net

Models

Linear
Logit
Probit
Poisson
Cox

Selection methods

Cross-validation
Adaptive lasso
Plugin
BIC
User-specified

Lasso with clustered data
Postestimation

Cross-validation function plots
Coefficient path plots
Select different λ
Tables of variables as they enter and leave model
Measures of fit by λ
Compare fit across multiple lassos

Helper commands

Split data randomly
Manage large numbers of variables

Documentation

370-page [LASSO] manual for lasso for prediction and lasso for inference

We are faced with more and more data, often with many, and poorly described or understood, variables. We can even have more variables than we do data. Classical techniques break down when applied to such data.

The lasso is designed to sift through this kind of data and extract features that have the ability to predict outcomes.

Stata gives you the tools to use lasso for predicton and for characterizing the groups and patterns in your data (model selection). Use the lasso itself to select the variables that have real information about your response variable. Use split-sampling and goodness of fit to be sure the features you find generalize outside of your training (estimation) sample.

With the lasso command, you specify potential covariates, and it selects the covariates to appear in the model. The fitted model is suitable for making out-of-sample predictions but not directly applicable for statistical inference. If inference is your interest, see our description of Lasso for inference.

There are lots of lasso commands. Here are the most important ones for prediction.

You have an outcome y and variables x1-x1000. Among them might be a subset good for predicting y. Lasso attempts to find them. Type

. lasso linear y x1-x1000

To see the variables selected, type

. lassocoef

To make predictions with new data, type

. use newdata
. predict yhat

To see the fit in the new data, type

. lassogof

Lasso fits logit, probit, Poisson and Cox proportional hazards models too.

. lasso logit z x1-x1000
. lasso probit z x1-x1000
. lasso poisson c x1-x1000
. lasso cox x1-x1000

And it fits elastic-net models.

. elasticnet linear y x1-x1000
. elasticnet logit z x1-x1000
. elasticnet probit z x1-x1000
. elasticnet poisson c x1-x1000
. elasticnet cox x1-x1000

Because ridge regression is a special case of elastic net, it fits ridge regressions too.

Square-root lasso is a variant of lasso for linear models.

. sqrtlasso y x1-x1000

You can force the selection of variables such as x1-x4.

. lasso linear y (x1-x4) x5-x1000

After fitting a lasso, you can use the postlasso commands.

. lassoknots                 table of estimated models by lambda
. lassocoef                  selected variables
. lassogof                   goodness of fit
. lassoselect lambda = 0.1   select model for another lambda
. coefpath                   plot coefficient path
. cvplot                     plot cross-validation function

And then there are features that will make it easier to do all the above. Need to split your data into training and testing samples? Type

. splitsample, generate(sample) nsplit(2)

Need to manage large variable lists? You do. We typed x1-x1000 above, but your variables will have real names, and you do not want to type them all. Use the vl commands to create lists of variables:

. vl set                    // creates vlcontinuous, vlcategorical, ...
. vl create myconts       = vlcontinuous
. vl modify myconts       = myconts - (kl srh srd polyt)
. vl create myfactors     = vlcategorical
. vl substitute myvarlist = i.myfactors myconts i.myfactors#c.myconts

The # sign creates interactions.

We just created myvarlist, which is ready for use in a lasso command such as

. lasso linear y $myvarlist

Let's see it work

We are going to show you three examples.

Lasso with λ selected by cross-validation.
The same lasso, but we select λ to minimize the BIC.
The same lasso, fit by adaptive lasso.

And then we are going to compare them.

Rather than typing

. lasso linear q104 (i.gender i.q3 i.q4 i.q5) i.q2 i.q6 i.q7 i.q8 i.q9
     i.q10 i.q11 i.q13 i.q14 i.q16 i.q17 i.q19 i.q25 i.q26 i.q29
     i.q30 i.q32 i.q33 i.q34 i.q36 i.q37 i.q38 i.q40 i.q41 i.q42
     i.q43 i.q44 i.q46 i.q47 i.q48 i.q49 i.q50 i.q51 i.q55 i.q56
     i.q57 i.q58 i.q59 i.q61 i.q64 i.q65 i.q67 i.q68 i.q69 i.q71
     i.q72 i.q73 i.q74 i.q75 i.q77 i.q78 i.q79 i.q82 i.q83 i.q84
     i.q85 i.q86 i.q88 i.q89 i.q90 i.q91 i.q94 i.q95 i.q96 i.q97
     i.q98 i.q100 i.q101 i.q102 i.q105 i.q108 i.q109 i.q110 i.q113
     i.q114 i.q115 i.q116 i.q117 i.q118 i.q122 i.q123 i.q125 i.q126
     i.q128 i.q130 i.q133 i.q134 i.q136 i.q137 i.q138 i.q140 i.q142
     i.q143 i.q144 i.q145 i.q146 i.q147 i.q148 i.q149 i.q150 i.q151
     i.q152 i.q153 i.q154 i.q155 i.q156 i.q158 i.q159 i.q160 i.q161
     age q1 q15 q18 q20 q21 q22 q24 q31 q35 q45 q52 q53 q62 q63 q70
     q76 q87 q93 q103 q111 q112 q120 q121 q129 q131 q132 q139 q157

we have used vl behind the scenes, so that we can type

. lasso linear q104 ($idemographics) $ifactors $vlcontinuous

And so that we can compare the out-of-sample predictions for the three models, we have already split our sample in two by typing

. splitsample, generate(sample) nsplit(2) rseed(1234)

We will fit all three models on sample==1 and later compare predictions using sample==2.

Example 1: Lasso with λ selected by cross-validation

To fit a lasso with the default cross-validation selection method, we type

. lasso linear q104 ($idemographics) $ifactors $vlcontinuous if sample == 1

10-fold cross-validation with 100 lambdas ...
Grid value 1:     lambda = .9109571   no. of nonzero coef. =       4
Folds: 1...5....10   CVF = 16.93341
output omitted
Grid value 23:    lambda = .1176546   no. of nonzero coef. =      74
Folds: 1...5....10   CVF = 12.17933
... cross-validation complete ... minimum found

Lasso linear model                          No. of obs        =        458
                                            No. of covariates =        273
Selection: Cross-validation                 No. of CV folds   =         10



                                          No. of      Out-of-      CV mean
                                         nonzero       sample   prediction
      ID       Description      lambda     coef.    R-squared        error

       1      first lambda    .9109571         4       0.0147     16.93341
      18     lambda before    .1873395        42       0.2953     12.10991
    * 19   selected lambda    .1706967        49       0.2968     12.08516
      20      lambda after    .1555325        55       0.2964     12.09189
      23       last lambda    .1176546        74       0.2913     12.17933


* lambda selected by cross-validation.

Lambda (λ) is lasso's penalty parameter. Lasso fits a range of models, from models with no covariates to models with lots, corresponding to models with large λ to models with small λ.

Lasso then selected a model. Because we did not specify otherwise, it used its default, cross-validation (CV) to choose model ID=19, which has λ=0.171. The model has 49 covariates.

Cross-validation chooses the model that minimizes the cross-validation function. Here is a graph of it.

. cvplot

We plan on comparing this model with two other models, so we will store these estimates. We will store them under the name cv.

. estimates store cv

Example 2: The same lasso, but we select λ to minimize the BIC

We can select the model corresponding to any λ we wish after fitting the lasso. Picking the λ that has the minimum Bayes information criterion (BIC) gives good predictions under certain conditions.

To fit a lasso with minimum BIC, we use the same command and specify the additional option selection(bic):

. lasso linear q104 ($idemographics) $ifactors $vlcontinuous
     if sample == 1, selection(bic)


Evaluating up to 100 lambdas in grid ...
Grid value 1:     lambda = .9109571   no. of nonzero coef. =       4
                  BIC = 2618.642                                    
Grid value 2:     lambda = .8300302   no. of nonzero coef. =       7
                  BIC = 2630.961                                    
Grid value 3:     lambda = .7562926   no. of nonzero coef. =       8
                  BIC = 2626.254                                    
Grid value 4:     lambda = .6891057   no. of nonzero coef. =       9
                  BIC = 2619.727                                    
Grid value 5:     lambda = .6278874   no. of nonzero coef. =      10
                  BIC = 2611.577                                    
Grid value 6:     lambda = .5721076   no. of nonzero coef. =      13
                  BIC = 2614.155                                    
Grid value 7:     lambda = .5212832   no. of nonzero coef. =      13
                  BIC = 2597.164                                    
Grid value 8:     lambda = .4749738   no. of nonzero coef. =      14
                  BIC = 2588.189                                    
Grid value 9:     lambda = .4327784   no. of nonzero coef. =      16
                  BIC = 2584.638                                    
Grid value 10:    lambda = .3943316   no. of nonzero coef. =      18
                  BIC = 2580.891                                    
Grid value 11:    lambda = .3593003   no. of nonzero coef. =      22
                  BIC = 2588.984                                    
Grid value 12:    lambda =  .327381   no. of nonzero coef. =      26
                  BIC = 2596.792                                    
Grid value 13:    lambda = .2982974   no. of nonzero coef. =      27
                  BIC = 2586.521                                    
Grid value 14:    lambda = .2717975   no. of nonzero coef. =      28
                  BIC = 2578.211                                    
Grid value 15:    lambda = .2476517   no. of nonzero coef. =      32
                  BIC = 2589.632                                    
Grid value 16:    lambda =  .225651   no. of nonzero coef. =      35
                  BIC = 2593.753                                    
Grid value 17:    lambda = .2056048   no. of nonzero coef. =      37
                  BIC = 2592.923                                    
Grid value 18:    lambda = .1873395   no. of nonzero coef. =      42
                  BIC = 2609.975                                    
Grid value 19:    lambda = .1706967   no. of nonzero coef. =      49
                  BIC = 2639.437                                    
... selection BIC complete ... minimum found                        


Lasso linear model                          No. of obs        =        458
                                            No. of covariates =        273
Selection: Bayesian information criterion



                                          No. of                          
                                         nonzero    In-sample             
      ID       Description      lambda     coef.    R-squared          BIC


       1      first lambda    .9109571         4       0.0308     2618.642
      13     lambda before    .2982974        27       0.3357     2586.521
    * 14   selected lambda    .2717975        28       0.3563     2578.211
      15      lambda after    .2476517        32       0.3745     2589.632
      19       last lambda    .1706967        49       0.4445     2639.437


* lambda selected by Bayesian information criterion

We can draw the BIC function plot:

. bicplot

We will store these results by minBIC.

. estimates store minBIC

Example 3. The same lasso, fit by adaptive lasso

Adaptive lasso is another selection technique that tends to select fewer covariates. It also uses cross-validation but runs multiple lassos. By default, it runs two.

To fit an adaptive lasso, we use the same command and specify the additional option selection(adaptive):

. lasso linear q104 ($idemographics) $ifactors $vlcontinuous
     if sample == 1, selection(adaptive)

Lasso step 1 of 2:

10-fold cross-validation with 100 lambdas ...
Grid value 1:     lambda = .9109571   no. of nonzero coef. =       4
Folds: 1...5....10   CVF = 17.012
Grid value 2:     lambda = .8300302   no. of nonzero coef. =       7
[output omitted]
Grid value 24:    lambda = .1072025   no. of nonzero coef. =      78
Folds: 1...5....10   CVF = 12.40012
... cross-validation complete ... minimum found

Lasso step 2 of 2:

Evaluating up to 100 lambdas in grid ...
Grid value 1:     lambda = 51.68486   no. of nonzero coef. =       4
[output omitted]
Grid value 100:   lambda = .0051685   no. of nonzero coef. =      59

10-fold cross-validation with 100 lambdas ...
Fold  1 of 10:  10....20....30....40....50....60....70....80....90....100
[output omitted]
Fold 10 of 10:  10....20....30....40....50....60....70....80....90....100
... cross-validation complete

Lasso linear model                         No. of obs         =        458
                                           No. of covariates  =        273
Selection: Adaptive                        No. of lasso steps =          2

Final adaptive step results


                                          No. of      Out-of-      CV mean
                                         nonzero       sample   prediction
      ID       Description      lambda     coef.    R-squared        error

      25      first lambda    51.68486         4       0.0101     17.01083
      77     lambda before    .4095937        46       0.3985     10.33691
    * 78   selected lambda    .3732065        46       0.3987     10.33306
      79      lambda after    .3400519        47       0.3985     10.33653
     124       last lambda    .0051685        59       0.3677     10.86697


* lambda selected by cross-validation in final adaptive step.

Adaptive lasso selected a model with 46 covariates instead of the 49 selected by ordinary lasso.

We will store these results as adaptive.

. estimates store adaptive

Comparison of results

We have three sets of results.

cv contains the model selected by CV.

minBIC contains the model selected by us that corresponds to the minimum BIC.

adaptive contains the model selected by adaptive lasso.

First, let's compare the variables each selected. The lassocoef command does this. We specify sort(coef, standardized) so that the variables with the largest absolute values of their coefficients are listed first.



                  cv       minBIC    adaptive 

       0.q19       x         x          x     
       0.q85       x         x          x     
        1.q5       x         x          x     
      3.q156       x         x          x     
      0.q101       x         x          x     
       0.q88       x         x          x     
       0.q48       x         x          x     
         q22       x         x          x     
                                              
         q38                                  
          4        x         x          x     
                                              
        q139       x         x          x     
       0.q56       x         x          x     
         q31       x         x          x     
       0.q73       x         x          x     
       0.q96       x         x          x     
    1.gender       x         x          x     
       0.q50       x         x          x     
        1.q3       x         x          x     
       3.q16       x         x          x     
       2.q84       x         x          x     
       0.q43       x         x          x     
      0.q149       x         x          x     
      0.q159       x         x          x     
      3.q134       x         x          x     
       0.q49       x                    x     
      0.q115       x         x          x     
      0.q108       x         x          x     
      0.q109       x                    x     
      0.q140       x                    x     
       0.q91       x                    x     
                                              
         q38                                  
          3        x         x          x     
                                              
         q93       x                    x     
       0.q14       x                    x     
      0.q153       x                    x     
      0.q160       x         x          x     
         age       x                    x     
         q53       x                    x     
      2.q105       x                          
      0.q102       x                    x     
      0.q154       x                    x     
        q111       x                    x     
      0.q142       x                    x     
       0.q55       x                          
       0.q97       x                          
                                              
         q65                                  
          4        x                    x     
                                              
      1.q110       x                    x     
         q70       x                          
       _cons       x         x          x     
       0.q44                            x     


Legend:
  b - base level
  e - empty cell
  o - omitted
  x - estimated

Start at the top and look down, and you will see that all three approaches selected the first 23 variables listed in the table, the variables with the largest coefficients.

Which model produces the best predictions? Let's do out-of-sample prediction to find out. We split our data into two samples at the outset for just this purpose. We fit the models on sample 1. We can compare predictions for sample 2.

The lassogof command reports goodness-of-fit statistics. We specify option postselection to compare predictions based on the postselection coefficients instead of the penalized coefficients. We specify option over(sample) so that lassogof calculates fit statistics for each sample separately.

. lassogof cv minBIC adaptive, over(sample) postselection

Penalized coefficients


Name             sample           MSE    R-squared        Obs

cv                                                           
                      1      8.652771       0.5065        503
                      2      14.58354       0.2658        493

minBIC                                                       
                      1      9.740229       0.4421        508
                      2      13.44496       0.3168        503

adaptive                                                     
                      1      8.637575       0.5057        504
                      2      14.70756       0.2595        494

We compare MSE and R-squared for sample 2. minBIC did best by both measures.

Tell me more

Learn more about Stata's lasso features.

Read more about lasso for prediction in the Lasso Reference Manual; see [LASSO] lasso intro.

See [D] splitsample for more about the splitsample command.

See [D] vl for more about the vl command for constructing long variable lists.

Also see Bayesian lasso.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


		No. of Out-of- CV mean
		nonzero sample prediction
ID		Description lambda coef. R-squared error

1		first lambda .9109571 4 0.0147 16.93341
18		lambda before .1873395 42 0.2953 12.10991
* 19		selected lambda .1706967 49 0.2968 12.08516
20		lambda after .1555325 55 0.2964 12.09189
23		last lambda .1176546 74 0.2913 12.17933


		No. of
		nonzero In-sample
ID		Description lambda coef. R-squared BIC


1		first lambda .9109571 4 0.0308 2618.642
13		lambda before .2982974 27 0.3357 2586.521
* 14		selected lambda .2717975 28 0.3563 2578.211
15		lambda after .2476517 32 0.3745 2589.632
19		last lambda .1706967 49 0.4445 2639.437


		cv minBIC adaptive

0.q19		x x x
0.q85		x x x
1.q5		x x x
3.q156		x x x
0.q101		x x x
0.q88		x x x
0.q48		x x x
q22		x x x

q38
4		x x x

q139		x x x
0.q56		x x x
q31		x x x
0.q73		x x x
0.q96		x x x
1.gender		x x x
0.q50		x x x
1.q3		x x x
3.q16		x x x
2.q84		x x x
0.q43		x x x
0.q149		x x x
0.q159		x x x
3.q134		x x x
0.q49		x x
0.q115		x x x
0.q108		x x x
0.q109		x x
0.q140		x x
0.q91		x x

q38
3		x x x

q93		x x
0.q14		x x
0.q153		x x
0.q160		x x x
age		x x
q53		x x
2.q105		x
0.q102		x x
0.q154		x x
q111		x x
0.q142		x x
0.q55		x
0.q97		x

q65
4		x x

1.q110		x x
q70		x
_cons		x x x
0.q44		x


Name sample		MSE R-squared Obs

cv
1		8.652771 0.5065 503
2		14.58354 0.2658 493

minBIC
1		9.740229 0.4421 508
2		13.44496 0.3168 503

adaptive
1		8.637575 0.5057 504
2		14.70756 0.2595 494