»  Home »  Stata News »  Vol 36 No 4 »  In the spotlight: Estimating treatment effects with lasso

## In the spotlight: Estimating treatment effects with lasso

Estimating treatment effects in the potential outcome framework is a powerful tool for evaluating the effectiveness of a treatment based on observational data. However, in the presence of high-dimensional data, researchers often face a dilemma about how to build the model. On one hand, we want to have deep insights by making good use of large amounts of data. On the other hand, the more complex the model is, the more difficult it is to fit such a model. We want to include more variables in the model, but the traditional estimation techniques cannot fit such models.

To resolve this conflict, we need to use model selection techniques such as lasso to select the variables that matter. At the same time, we want our estimator to be robust to model selection mistakes. In other words, we want our estimation results to still be valid even if lasso omits some important variables or includes some extra variables.

The new telasso command is designed to estimate treatment effects with many control variables and be robust to model selection mistakes. Through an example that compares two types of lung transplants, we will illustrate the dilemma or conflict of including many variables in the treatment-effects estimation and show how to use telasso to reconcile this conflict.

## Lung transplant data and control variables

First, we introduce the data and construct the control variables for both the outcome model and the treatment model.

Suppose we want to compare two types of lung transplants. Bilateral lung transplant (BLT) is usually associated with a higher death rate in the short term after the operation but with a more significant improvement in life quality than the single lung transplant (SLT). As a result, for patients who need to decide between these two treatment options, knowing the effect of BLT (versus SLT) on quality of life is essential. We can measure the quality of life based on an individual’s forced expiratory volume in one second (FEV1).

We have a fictional dataset (lung.dta) inspired by Koch, Vock, and Wolfson (2018). The outcome (fev1p) is FEV1% measured one year after the operation. FEV1% is the percentage of FEV1 that the patient has relative to a healthy person with similar characteristics. The treatment variable (transtype) indicates whether the treatment is BLT or SLT.

To start, we open the dataset and describe it.

. use https://www.stata-press.com/data/r17/lung, clear
(Fictional data on lung transplant)

. describe *, short

Variable      Storage   Display    Value
name         type    format    label      Variable label

agep            byte    %10.0g                Patient age (years)
bmip            double  %10.0g                Patient body mass index
diabetesp       byte    %12.0g     lbdiab     Patient diabetes status
heightp         double  %10.0g                Patient height (cm)
o2amt           double  %10.0g                Oxygen delivered
karn            byte    %8.0g      lbyes      Karnofsky score > 60
lungals         double  %10.0g                Lung allocation score
racep           byte    %8.0g      lbrace     Patient race
sexp            byte    %8.0g      lbsex      Patient gender
lifesvent       byte    %8.0g      lbyes      Life support ventilator needed
assisvent       byte    %8.0g      lbyes      Assisted ventilation needed
centervol       double  %10.0g                Center volume
walkdist        double  %10.0g                Walking distance in 6 minutes
o2rest          byte    %8.0g      lbyes      Oxygen needed at rest
aged            byte    %10.0g                Donor age (years)
raced           byte    %8.0g      lbrace     Donor race
bmid            double  %10.0g                Donor body mass index
smoked          byte    %8.0g      lbyes      Donor if has history of smoking
cmv             byte    %8.0g      lbyes      Positive cytomegalovirus test
deathcause      byte    %8.0g      lbyes      Cause of death - traumatic brain injury
diabetesd       byte    %12.0g     lbdiab     Donor diabetes status
expandd         byte    %8.0g      lbyes      Expanded donor needed
heightd         double  %10.0g                Donor height (cm)
sexd            byte    %8.0g      lbsex      Donor gender
distd           int     %10.0g                Donor to treatment center distance
lungpo2         double  %10.0g                Lung PO2
lungalloc       byte    %8.0g      lballo     Lung allocation status
hratio          double  %10.0g                Height ratio
ischemict       double  %10.0g                Ischemic time
genderm         byte    %19.0g     lbgm       Matching gender status
racem           byte    %17.0g     lbrm       Matching race status
transtype       byte    %8.0g      lbtau      Lung transplant type
fev1p           double  %10.0g                Percentage of predicted value of FEV1


In addition to our treatment and outcome variables, we have 29 variables that record characteristics of the patients and donors. To construct control variables, we want to use these 29 variables and the interactions among them. It would be tedious to type these variable names one by one to distinguish between continuous and categorical variables. vl is a suite of commands that simplifies this process. First, we use vl set to partition the variables into continuous and categorical variables automatically. The global macro $vlcategorical contains all the categorical variable names, and$vlcontinuous contains all the continuous variable names.

. quietly vl set

. display "$vlcategorical"' diabetesp karn racep sexp lifesvent assisvent o2rest raced smoked cmv deathcause diabetesd > expandd sexd lungalloc genderm racem transtype . display "$vlcategorical"'
diabetesp karn racep sexp lifesvent assisvent o2rest raced smoked cmv deathcause diabetesd
>  expandd sexd lungalloc genderm racem transtype


Second, we use vl create to create customized variable lists. Specifically, $cvars contains all the continuous variables except the outcome (fev1p), and$fvars contains all the categorical variables except the treatment (transtype). Finally, vl sub substitutes the global macro $allvars with the full second-order interaction between the continuous variables in$cvars and categorical variables in $fvars. We will use$allvars as the control variables for both the outcome model and the treatment model.

. vl create cvars = vlcontinuous - (fev1p)
note: $cvars initialized with 12 variables. . vl create fvars = vlcategorical - (transtype) note:$fvars initialized with 17 variables.

. vl sub allvars = c.cvars i.fvars c.cvars#i.fvars


## Dilemma: To include or not to include?

We have created the control variables, and we want to include all of them to estimate the treatment effect of a single lung transplant versus the bilateral lung transplant. First, however, the question is: Can we fit such a model? So let's try!

teffects is a Stata command that provides multiple estimators for the treatment effects. We will try to use teffects to estimate the treatment effect by including all the controls.

. capture noisily teffects aipw (fev1p $allvars) (transtype$allvars)
Note: tmodel mlogit initial estimates did not converge; the model may not be identified
treatment 0 has 297 propensity scores less than 1.00e-05
treatment 1 has 205 propensity scores less than 1.00e-05
treatment overlap assumption has been violated; use option osample() to identify the
overlap violators


teffects produces an error complaining that the overlap assumption has been violated. The overlap assumption means that each patient has a strictly positive probability of being treated or not treated. In other words, given any patient in the treatment group, the overlap assumption implies that we can find a similar patient in the control group. That is, there is an overlap between the treatment and control groups.

In our example, including all of these controls violates the overlap assumption because some specific combination of values of the control variables appears in either the treatment group or the control group but not both. The more control variables there are, the more difficult it is to satisfy the overlap assumption.

The dilemma is that including all the controls makes the model inestimable, but not including all of them renders our model too simple to approximate the reality.

## telasso: Select variables that matter

Now we try to fit the same model as above using telasso. As the name indicates, telasso is a combination of teffects and lasso. So we are using lasso to select variables in the treatment and outcome models and using the selected variables in the treatment-effects estimation.

We assume a linear outcome model and a logit treatment model. We type

. telasso (fev1p $allvars) (transtype$allvars)

Estimating lasso for outcome fev1p if tran~e = 0 using plugin method ...
Estimating lasso for outcome fev1p if tran~e = 1 using plugin method ...
Estimating lasso for treatment tran~e using plugin method ...
Estimating ATE ...

Treatment-effects lasso estimation    Number of observations      =        937
Outcome model:   linear               Number of controls          =        454
Treatment model: logit                Number of selected controls =          8

Robust
fev1p   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

ATE
transtype
(BLT vs SLT)      37.51841   .1606703   233.51   0.000     37.20351    37.83332

POmean
transtype
SLT       46.4938   .2021582   229.99   0.000     46.09757    46.89002



In contrast to the teffects results in the above section, telasso can estimate the treatment effects when we include all the controls. The difference is that telasso selects only 8 variables among the 454 control variables. So telasso selects only variables that matter.

More importantly, the estimator implemented in telasso is robust to the model selection mistakes made by lasso. Thus, the estimation results are still valid even if some important variables are not included in the eight selected variables or if some extra variables are included in them.

The estimation results can be interpreted as usual. If all the patients were to choose BLT, the FEV1% is expected to be 38% higher than the 46% average expected if all patients were to choose an SLT.

## Double machine learning

The estimates obtained above relied on a critical assumption of lasso, the sparsity assumption, which requires that only a small number of the potential covariates are in the “true” model. We can use a double machine learning technique to allow for more covariates in the true model. To do this, we add the xfold(5) option to split the sample into five groups and perform crossfitting, and we add the resample(3) option to repeat the cross-fitting procedure with three samples.

To guarantee that we can later reproduce the estimation results, we also set the random-number seed. We type

. set seed 12345671

. telasso (fev1p $allvars) (transtype$allvars), xfolds(5) resample(3) nolog

Treatment-effects lasso estimation    Number of observations       =       937
Number of controls           =       454
Number of selected controls  =        16
Outcome model:   linear               Number of folds in cross-fit =         5
Treatment model: logit                Number of resamples          =         3

Robust
fev1p   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

ATE
transtype
(BLT vs SLT)      37.52837   .1683194   222.96   0.000     37.19847    37.85827

POmean
transtype
SLT       46.4941   .2040454   227.86   0.000     46.09418    46.89402



The estimated treatment effect is similar to the first telasso command reported, but the selected model included 16 controls instead of 8. The similarity of the estimates across the different specifications suggests that our first model did not violate the sparsity assumption.

## Concluding remarks

I showed the conflicts that researchers face when estimating the treatment effects with many control variables and using telasso to solve these conflicts. To learn more about estimating treatment effects using lasso, see [TE] telasso.

## Reference

Koch, B., D. M. Vock, and J. Wolfson. 2018. Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74: 8–17. https://doi.org/10.1111/biom.12736

— by Di Liu
Senior Econometrician and Software Developer