In the spotlight: Lasso
You have lots of data. Lots of variables. Maybe even more variables than observations. Perhaps you have genetic data and want to predict a certain type of cancer. Perhaps you have demographic data and want to predict employment status. Or perhaps you have data recording words used in restaurant reviews and want to predict health inspection scores. When you know some variables will be helpful in predicting the outcome but you don't know which ones, lasso can help.
With Stata 16's new lasso features, you can sift through many potential variables and extract ones that have the ability to predict outcomes. With a command such as
. lasso linear y x1-x1000
you can select from among 1000 potential variables. Or if your outcome is binary, you could type
. lasso logit y x1-x1000
If you want to select variables in a training sample and evaluate performance in a validation sample, you add just a few more commands.
. splitsample, generate(sample)
. lasso linear y x1-x1000 if sample==1
. lassogof if sample==2
In our blog post An introduction to the lasso in Stata, we demonstrate how to use various techniques such as cross-validation and adaptive lasso to select variables and how to evaluate their predictive abilities.
Sometimes, you will want to go beyond variable selection and prediction. You might want standard errors, tests, and confidence intervals for coefficients on some variables of interest. When inference is the goal, Stata 16 also provides a suite of lasso commands that provide proper inference for a subset of variables while using lasso methods to select controls from many other variables. For instance, you can use a method called double selection and perform inference for x1 and x2 by typing
. dsregress y x1 x2, controls(x3-x1000)
In our blog post Using lasso for inference in high-dimensional models, we give an overview of inference with lasso and walk you through examples of three estimators available in Stata 16.
— David Drukker
Executive Director of Econometrics
— Di Liu