- Methods
- Double selection
- Partialing out
- Cross-fit partialing out

- Models
- Linear regression
- Instrumental variables
- Logistic (logit) regression
- Poisson regression

- Postestimation
- Inference statistics for specified variables of interest
- Joint hypotheses
- Save estimation results to disk, including underlying lassos
- Examine underlying lassos

We are increasingly faced with more and more data and with harder and harder questions.

Need to sort relevant from irrelevant variables? Try lasso.

Unsure how control variables affect your outcome? Try lasso.

Concerned about nonlinearities and interactions? Try lasso.

Unsure how control variables affect your outcome? Try lasso.

Concerned about nonlinearities and interactions? Try lasso.

The lasso and some other machine learning techniques are reshaping the dialog about how we perform inference. They let us focus on our questions of interest and be less concerned about the unimportant parts of our model. The remainder of our model can be adequately captured by sifting through hundreds or even thousands of potential covariates or a highly nonlinear expansion of potential covariates.

Focus on what interests you and let lasso discover the features that adequately represent the rest of your model.

Stata's lasso for inference commands reports coefficients, standard errors, etc. for specified variables of interest and uses lasso to select the other covariates (controls) that need to appear in the model from the potential control variables you specify.

The inference methods are robust to model-selection mistakes that lasso might make.

Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error.

However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify.

We will show you three examples.

- Double selection, linear regression
- Double selection, logistic regression
- Cross-fit partialing out, instrumental variables

We are about to use double selection, but the example below applies
to all the methods. Rather than using **dsregress**, you could
have used **poregress** or **xporegress**.

We have data on 4,642 birthweights and 22 variables about the baby's mother and father. We want to know whether the mother's smoking and education affect birthweight. The variables of interest are

i.msmoke |
how much the mother smokes (categorical) |

medu |
mother's education (years of schooling) |

**i.** is how categorical variables are written in Stata.

We are going to specify the control variables as follows:

continuous: | |

mage |
mother's age |

fedu |
father's education |

monthslb |
months since mother last gave birth |

categorical: | |

i.foreign |
if mother is foreign born (0/1) |

i.alcohol |
if mother drinks during pregnancy (0/1) |

i.prenatal1 |
prenatal visit in one trimester (0/1) |

i.mmarried |
if mother is married to father (0/1) |

i.order |
birth order of infant (0th, 1st, 2nd) |

We worry that interactions might also be important, so we are going to
fit the model of **bweight** on **i.msmoke** and **medu** and

i.foreign |

i.alcohol##i.prenatal1 |

i.mmarried#(c.mage##c.mage) |

i.order##(c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) |

That is a total of 104 covariates. Yet we do not worry about overfitting the model, because the control variables that we specify are potential control variables. Lasso will select the relevant ones.

The command **dsregress** will select the covariates and present the results
for the covariates of interest:

.dsregress bweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )Estimating lasso for bweight using plugin Estimating lasso for 1bn.msmoke using plugin Estimating lasso for 2bn.msmoke using plugin Estimating lasso for 3bn.msmoke using plugin Estimating lasso for medu using plugin Double-selection linear model Number of obs = 4,642 Number of controls = 104 Number of selected controls = 15 Wald chi2(4) = 94.48 Prob > chi2 = 0.0000

Robust | ||

bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |

msmoke | ||

1-5 daily | -157.5933 36.54639 -4.31 0.000 -229.223 -85.96374 | |

6-10 daily | -215.8084 34.53717 -6.25 0.000 -283.5 -148.1168 | |

11+ daily | -260.0144 34.41246 -7.56 0.000 -327.4616 -192.5672 | |

medu | 3.306897 4.321033 0.77 0.444 -5.162172 11.77597 | |

We find

- the more the mother smokes, the less the baby weighs.
- the mother's education affects the birthweight trivially (3 grams/year of education) and is not significant.

Note that the output reports that we specified 104 control variables, and lasso selected 15 of them.

In the literature, the concern is often about low-birthweight babies, which weigh less than 2,500 grams.

Let's fit the equivalent low-birthweight model. We will specify the same
potential control variables, but we will fit the model using **dslogit**
instead of **dsregress**. We will use **dslogit**, but if we wanted
to use partialing out or cross-fit partialing out, we could also use
**pologit** or **xpologit**.

Here is the result.

.dslogit lbweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )Estimating lasso for lbweight using plugin Estimating lasso for 1bn.msmoke using plugin Estimating lasso for 2bn.msmoke using plugin Estimating lasso for 3bn.msmoke using plugin Estimating lasso for medu using plugin Double-selection logit model Number of obs = 4,636 Number of controls = 104 Number of selected controls = 18 Wald chi2(4) = 33.06 Prob > chi2 = 0.0000

Robust | ||

lbweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |

msmoke | ||

1-5 daily | .9083797 .3036388 -0.29 0.774 .4717819 1.749015 | |

6-10 daily | 2.518055 .4837748 4.81 0.000 1.727947 3.669443 | |

11+ daily | 2.042259 .4154557 3.51 0.000 1.370728 3.042778 | |

medu | .9538414 .0300264 -1.50 0.133 .8967696 1.014545 | |

Reported are odds ratios. We find

- smoking five or fewer cigarettes per day
*decreases*the odds that the baby is born with a low birthweight (the odds ratio is less than 1). The result is not significant, however, and for more than five cigarettes, the more the mother smokes, the greater the odds that the baby will weigh less than 2,500 grams. - the mother's education is still not significant.

We found no statistically significant effect of the mother's education
when we fit models for birthweight and low birthweight. The mother's
education, however, is presumably endogenous. We will specify the
same model and add more to it. We are going to specify that
**medu** is endogenous and specify the potential covariates for
washing out that endogeneity.

To fit the linear model, we previously typed

.dsregress bweight i.msmoke medu, controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )

Where we specified **medu**, we will substitute

(medu =potential instruments)

In particular, we will substitute

(medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) )

There is an additional change we have to make. We fit the original
model using double-selection **dsregress**. Double selection
cannot handle instrumental variables, but partialing out and
cross-fit partialing out can. We need to change **dsregress**
to **poregress** or **xporegress**. We will fit the model
using cross-fit partialing out:

.xpoivregress bweight i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) ), controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )Cross-fit fold 1 of 10 ... Estimating lasso for bweight using pluginoutput omittedCross-fit partialing-out Number of obs = 4,642 IV linear model Number of controls = 104 Number of instruments = 42 Number of selected controls = 22 Number of selected instruments = 4 Number of folds in cross-fit = 10 Number of resamples = 1 Wald chi2(4) = 93.87 Prob > chi2 = 0.0000

Robust | ||

bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |

medu | -5.994852 49.05562 -0.12 0.903 -102.1421 90.1524 | |

msmoke | ||

1-5 daily | -158.1356 38.39086 -4.12 0.000 -233.3804 -82.89094 | |

6-10 daily | -213.5149 38.3374 -5.57 0.000 -288.6548 -138.3749 | |

11+ daily | -259.3824 38.68729 -6.70 0.000 -335.2081 -183.5567 | |

The mother's education is still not significant. Notice that lasso selected 4 instruments from the 22 we specified.

Don't you wish that the inference command could be shorter? The last command we fit was

.xpoivregress bweight i.msmoke (medu = c.fedu## (c.prenatal#c.prenatal##c.prenatal)## (i.foreign i.mmarried) ), controls(i.foreign i.alcohol##i.prenatal1 i.mmarried#(c.mage##c.mage) i.order##( c.mage#c.fedu c.mage##c.monthslb c.fedu##c.fedu) )

They can be shorter. We could have fit this command by typing

.xpoivregress bweight i.msmoke (medu = `instr'), controls(`controls')

Stata's new **vl** command makes it easy to construct lists of
variables. See **[D] vl**. We
demonstrate the use of **vl** there.

Read more about Stata's lasso for inference commands in the
*Stata Lasso Reference Manual*;
see [LASSO] Lasso inference intro
and [LASSO] Inference examples.

See Lasso for Prediction for Stata's other lasso capabilities.

See Nonparametric series regression, which can handle situations in which you know the control variables but not the functional form in which they appear in the true model.

Also see Bayesian lasso.