Home  /  Products  /  Features  /  Lasso for inference

<-  See Stata's other features

Highlights

  • Methods

    • Double selection

    • Partialing out

    • Cross-fit partialing out

  • Models

    • Linear regression

    • Instrumental variables

    • Logistic (logit) regression

    • Poisson regression

  • Postestimation

    • Inference statistics for specified variables of interest

    • Joint hypotheses

    • Save estimation results to disk, including underlying lassos

    • Examine underlying lassos

We are increasingly faced with more and more data and with harder and harder questions.

Need to sort relevant from irrelevant variables? Try lasso.
Unsure how control variables affect your outcome? Try lasso.
Concerned about nonlinearities and interactions? Try lasso.

The lasso and some other machine learning techniques are reshaping the dialog about how we perform inference. They let us focus on our questions of interest and be less concerned about the unimportant parts of our model. The remainder of our model can be adequately captured by sifting through hundreds or even thousands of potential covariates or a highly nonlinear expansion of potential covariates.

Focus on what interests you and let lasso discover the features that adequately represent the rest of your model.

Stata's lasso for inference commands reports coefficients, standard errors, etc. for specified variables of interest and uses lasso to select the other covariates (controls) that need to appear in the model from the potential control variables you specify.

The inference methods are robust to model-selection mistakes that lasso might make.

Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error.

However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify.

Let's see it work

We will show you three examples.

  1. Double selection, linear regression

  2. Double selection, logistic regression

  3. Cross-fit partialing out, instrumental variables

Example 1: Double selection, linear regression

We are about to use double selection, but the example below applies to all the methods. Rather than using dsregress, you could have used poregress or xporegress.

We have data on 4,642 birthweights and 22 variables about the baby's mother and father. We want to know whether the mother's smoking and education affect birthweight. The variables of interest are:

             i.msmoke              how much the mother smokes (categorical)
             medu              mother's education (years of schooling)

i. is how categorical variables are written in Stata.

We are going to specify the control variables as follows:

continuous:  
              mage mother's age
              fedu father's education
              monthslb months since mother last gave birth

categorical:  
              i.foreign if mother is foreign born (0/1)
              i.alcohol if mother drinks during pregnancy (0/1)
              i.prenatal1 prenatal visit in one trimester (0/1)
              i.mmarried if mother is married to father (0/1)
              i.order birth order of infant (0th, 1st, 2nd)

We worry that interactions might also be important, so we are going to fit the model of bweight on i.msmoke and medu and

i.foreign
i.alcohol##i.prenatal1
i.mmarried#(c.mage##c.mage)
i.order##(c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu)

That is a total of 104 covariates. Yet we do not worry about overfitting the model, because the control variables that we specify are potential control variables. Lasso will select the relevant ones.

The command dsregress will select the covariates and present the results for the covariates of interest:

. dsregress  bweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)           )

Estimating lasso for bweight using plugin
Estimating lasso for 1bn.msmoke using plugin
Estimating lasso for 2bn.msmoke using plugin
Estimating lasso for 3bn.msmoke using plugin
Estimating lasso for medu using plugin

Double-selection linear model         Number of obs               =      4,642
                                      Number of controls          =        104
                                      Number of selected controls =         15
                                      Wald chi2(4)                =      94.48
                                      Prob > chi2                 =     0.0000

Robust
bweight Coefficient std. err. z P>|z| [95% conf. interval]
msmoke
1-5 daily -157.5933 36.54639 -4.31 0.000 -229.223 -85.96374
6-10 daily -215.8084 34.53717 -6.25 0.000 -283.5 -148.1168
11+ daily -260.0144 34.41246 -7.56 0.000 -327.4616 -192.5672
medu 3.306897 4.321033 0.77 0.444 -5.162172 11.77597
Note: Chi-squared test is a Wald test of the coefficients of the variables of interest jointly equal to zero. Lassos select controls for model estimation. Type lassoinfo to see number of selected variables in each lasso.

We find

  1. the more the mother smokes, the less the baby weighs.

  2. the mother's education affects the birthweight trivially (3 grams/year of education) and is not significant.

Note that the output reports that we specified 104 control variables, and lasso selected 15 of them.

Example 2: Double selection, logistic regression

In the literature, the concern is often about low-birthweight babies, which weigh less than 2,500 grams.

Let's fit the equivalent low-birthweight model. We will specify the same potential control variables, but we will fit the model using dslogit instead of dsregress. We will use dslogit, but if we wanted to use partialing out or cross-fit partialing out, we could also use pologit or xpologit.

Here is the result.

. dslogit lbweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)           ) 

Estimating lasso for lbweight using plugin
Estimating lasso for 1bn.msmoke using plugin
Estimating lasso for 2bn.msmoke using plugin
Estimating lasso for 3bn.msmoke using plugin
Estimating lasso for medu using plugin

Double-selection logit model          Number of obs               =      4,636
                                      Number of controls          =        104
                                      Number of selected controls =         18
                                      Wald chi2(4)                =      33.06
                                      Prob > chi2                 =     0.0000

Robust
lbweight Odds ratio std. err. z P>|z| [95% conf. interval]
msmoke
1-5 daily .9083797 .3036388 -0.29 0.774 .4717819 1.749015
6-10 daily 2.518055 .4837748 4.81 0.000 1.727947 3.669443
11+ daily 2.042259 .4154557 3.51 0.000 1.370728 3.042778
medu .9538414 .0300264 -1.50 0.133 .8967696 1.014545
Note: Chi-squared test is a Wald test of the coefficients of the variables of interest jointly equal to zero. Lassos select controls for model estimation. Type lassoinfo to see number of selected variables in each lasso.

Reported are odds ratios. We find

  1. smoking five or fewer cigarettes per day decreases the odds that the baby is born with a low birthweight (the odds ratio is less than 1). The result is not significant, however, and for more than five cigarettes, the more the mother smokes, the greater the odds that the baby will weigh less than 2,500 grams.

  2. the mother's education is still not significant.

Example 3: Instrumental variables, cross-fit partialing out

We found no statistically significant effect of the mother's education when we fit models for birthweight and low birthweight. The mother's education, however, is presumably endogenous. We will specify the same model and add more to it. We are going to specify that medu is endogenous and specify the potential covariates for washing out that endogeneity.

To fit the linear model, we previously typed

. dsregress  bweight  i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)           ) 

Where we specified medu, we will substitute

(medu = potential instruments)

In particular, we will substitute

                 (medu = c.fedu##
                           (c.prenatal#c.prenatal##c.prenatal)##
                           (i.foreign i.mmarried)
                    )

There is an additional change we have to make. We fit the original model using double-selection dsregress. Double selection cannot handle instrumental variables, but partialing out and cross-fit partialing out can. We need to change dsregress to poregress or xporegress. We will fit the model using cross-fit partialing out:

. xpoivregress  bweight  i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)##
     (i.foreign i.mmarried)
     ),
     controls(i.foreign
     i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)           )

Cross-fit fold 1 of 10 ...
Estimating lasso for bweight using plugin
output omitted

Cross-fit partialing-out           Number of obs                  =      4,642
IV linear model                    Number of controls             =        104
                                   Number of instruments          =         42
                                   Number of selected controls    =         27
                                   Number of selected instruments =          5
                                   Number of folds in cross-fit   =         10
                                   Number of resamples            =          1
                                   Wald chi2(4)                   =      97.20
                                   Prob > chi2                    =     0.0000

Robust
bweight Coefficient std. err. z P>|z| [95% conf. interval]
medu -39.27263 40.76139 -0.96 0.335 -119.1635 40.61822
msmoke
1-5 daily -172.9989 38.1835 -4.53 0.000 -247.8372 -98.16065
6-10 daily -229.9561 36.82347 -6.24 0.000 -302.1288 -157.7834
11+ daily -275.7334 37.11482 -7.43 0.000 -348.4771 -202.9897
Endogenous: medu Exogenous: 1bn.msmoke 2bn.msmoke 3bn.msmoke Note: Chi-squared test is a Wald test of the coefficients of the variables of interest jointly equal to zero. Lassos select controls for model estimation. Type lassoinfo to see number of selected variables in each lasso.

The mother's education is still not significant. Notice that lasso selected 4 instruments from the 22 we specified.

Learn about vl

Don't you wish that the inference command could be shorter? The last command we fit was

. xpoivregress  bweight  i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)##
     (i.foreign i.mmarried)
     ),
     controls(i.foreign
     i.alcohol##i.prenatal1
     i.mmarried#(c.mage##c.mage)
     i.order##(
     c.mage#c.fedu
     c.mage##c.monthslb
     c.fedu##c.fedu)           )

They can be shorter. We could have fit this command by typing

. xpoivregress bweight i.msmoke (medu = `instr'), controls(`controls')

Stata's vl command makes it easy to construct lists of variables. See [D] vl. We demonstrate the use of vl there.

Tell me more

Read more about Stata's lasso for inference commands in the Stata Lasso Reference Manual; see [LASSO] Lasso inference intro and [LASSO] Inference examples.

See Lasso for Prediction for Stata's other lasso capabilities.

See Nonparametric series regression, which can handle situations in which you know the control variables but not the functional form in which they appear in the true model.

Also see Bayesian lasso.