»  Home »  Products »  Features »  Bayesian lasso

## Bayesian lasso

### Highlights

• Laplace prior for specifying L1 penalty for coefficients
• Gamma prior for the penalty hyperparameter
• Bayesian criteria for selecting important predictors

Stata has a suite of commands for performing lasso-based prediction and inference. But did you know that you can fit Bayesian lasso linear models by using bayes: regress? And that inference can be performed in the usual Bayesian way, just as with any other model?

Lasso uses L1-constrained least squares to estimate its coefficients. This involves selecting a value for the penalty parameter, which is often done using cross-validation.

The Bayesian approach is different. Constraints are imposed through prior distributions. The Laplace prior provides constraints that have the same analytic form as the L1 penalty used in lasso. The Laplace prior introduces the penalty parameter as one more model parameter that needs to be estimated from the data. One advantage of the Bayesian approach is that the reported credible intervals can be used to obtain proper inference for the parameters of interest. Within the frequentist approach, special methods are needed to obtain proper inference; see Lasso for inference.

To fit a lasso-style model, use the bayes: prefix or the bayesmh command with Laplace priors. No variable selection is performed automatically, but Bayesian analysis offers various ways to select variables. One is to use bayesstats summary to display a table of the posterior probabilities that each coefficient is different from 0 and select variables based on them.

Let's look at an example. Consider the diabetes data (Efron et al. 2004) on 442 diabetes patients that record a measure of disease progression (one year after baseline) as an outcome y and 10 baseline covariates: age, sex, body mass index, mean arterial pressure, and 6 blood serum measurements. We consider the version of these data in which the covariates are standardized to have zero mean and the sum of squares across all observations of one.

We would like to fit a linear regression model to y and determine which variables are important for predicting y. Efron et al. (2004) applied the least-angle-regression variable-selection method to these data. Park and Casella (2008) introduced Bayesian lasso and used it to analyze these same data. Below, we follow their approach to demonstrate how this can be done using bayes: regress.

For comparison, let's first use the traditional lasso command to select the important predictors.

. lasso linear y age sex bmi map tc ldl hdl tch ltg glu, nolog rseed(16)

Lasso linear model                          No. of obs        =        442
No. of covariates =         10
Selection: Cross-validation                 No. of CV folds   =         10

No. of      Out-of-      CV mean
nonzero       sample   prediction
ID       Description      lambda     coef.    R-squared        error

1      first lambda    45.16003         0       0.0008     5934.909
47     lambda before    .6254151         8       0.4896     3026.453
* 48   selected lambda    .5698549         8       0.4896      3026.42
49      lambda after    .5192306         8       0.4896     3026.567
83       last lambda    .0219595        10       0.4891     3029.735

* lambda selected by cross-validation.


The estimated penalty parameter lambda for the selected model is 0.57. We can see the selected coefficients and their penalized estimates by typing

. lassocoef, display(coef, penalized)

active

sex   -213.4102
bmi    524.8137
map    306.6641
tc   -154.2327
hdl   -184.5127
tch    58.66197
ltg    523.1158
glu    60.12939
_cons    152.1335



We now use bayes: regress to fit a Bayesian linear lasso model as described by Park and Casella (2008).

Let's look at the command specification first.

. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2}))))
prior({sigma2},  jeffreys)
prior({y:_cons}, normal(0, 1e6))
prior({lam2=1},  gamma(1, 1/1.78))
block({y:} {sigma2} {lam2}, split)
rseed(16) dots:
: regress y age sex bmi map tc ldl hdl tch ltg glu


For the coefficients, we use the Laplace prior with zero mean and the scale parameter that depends on the variance {sigma2} and squared penalization hyperparameter {lam2}. The authors suggest using the gamma prior with shape 1 and rate 1.78 (or scale 1/1.78) as the prior for {lam2}. For the intercept {y:_cons} and variance, we use vague priors: normal(0, 1e6) and jeffreys. To improve sampling efficiency for these data, we sample all parameters in separate blocks.

Let's now run the model.

. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2}))))
prior({sigma2},  jeffreys)
prior({y:_cons}, normal(0, 1e6))
prior({lam2=1},  gamma(1, 1/1.78))
block({y:} {sigma2} {lam2}, split)
rseed(16) dots
: regress y age sex bmi map tc ldl hdl tch ltg glu

Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done

Model summary

Likelihood:
y ~ regress(xb_y,{sigma2})

Priors:
{y:age sex bmi map tc ldl hdl tch ltg glu} ~ laplace(0,)                 (1)
{y:_cons} ~ normal(0,1e6)               (1)
{sigma2} ~ jeffreys

Hyperprior:
{lam2} ~ gamma(1,1/1.78)

Expression:
expr1 : sqrt({sigma2}/{lam2})

(1) Parameters are elements of the linear form xb_y.

Bayesian linear regression                       MCMC iterations  =     12,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
MCMC sample size =     10,000
Number of obs    =        442
Acceptance rate  =      .4379
Efficiency:  min =      .0152
avg =      .1025
Log marginal-likelihood = -2415.7171                          max =      .2299

Equal-tailed
Mean   Std. dev.     MCSE     Median  [95% cred. interval]

y
age   -2.478525   52.97851   1.26623  -2.593401  -108.3303   104.2442
sex   -209.4461   61.21979   1.70006  -211.1479  -330.2515  -85.52568
bmi    522.1367   66.76557    1.8115   520.6348   393.4224   656.4993
map    304.1617   65.26244   1.77912   306.1749   175.0365   432.3554
tc   -172.2847   157.5097   12.7739  -159.4816  -523.0447    110.226
ldl    1.304382   128.3598   9.96343  -7.796492  -251.2571   298.4382
hdl   -158.8146   112.6562   6.82563  -158.1347  -378.4126   48.93263
tch    91.27437   111.8483   6.06667   86.32462   -114.675   319.0824
ltg    515.5167   94.06607   5.83902   509.9952   342.9893    715.739
glu    67.94583   62.86024   1.69235   66.11433   -51.1174   197.7894
_cons    152.0964   2.545592   .053095   152.0963   146.9166   157.1342

sigma2    2961.246   207.0183   4.79372   2949.282   2587.023   3389.206
lam2    .0889046    .055257   .001899   .0769573    .020454    .229755



The coefficient estimates above are somewhat similar to the penalized estimates of the variables selected by lasso.

We can compare the estimates of the penalized parameters:

. bayesstats summary (lambda:sqrt({lam2}))

Posterior summary statistics                      MCMC sample size =    10,000

lambda : sqrt({lam2})

Equal-tailed
Mean   Std. dev.     MCSE     Median  [95% cred. interval]

lambda     .285446   .0861738   .003292   .2774119   .1430174   .4793277



Bayesian lasso estimated lambda to be 0.29, which is smaller than lasso's 0.57, but this estimate is not directly comparable because lasso standardizes covariates to have the scale of 1 during the computation.

Although Bayesian lasso does not automatically select variables, we can use some of Bayesian postestimation tools to help us select the variables. For instance, we can use bayesstats summary to compute the posterior probabilities that each coefficient is less than 0. If the probability is close to 0.5, then the coefficient estimate is centered around 0, and the corresponding variable should not be selected in the model.

. bayesstats summary (age:{y:age}<0) (sex:{y:sex}<0) (bmi:{y:bmi}<0)
(map:{y:map}<0) (tc:{y:tc}<0) (ldl:{y:ldl}<0) (hdl:{y:hdl}<0)
(tch:{y:tch}<0) (ltg:{y:ltg}<0) (glu:{y:glu}<0)

Posterior summary statistics                      MCMC sample size =    10,000

age : {y:age}<0
sex : {y:sex}<0
bmi : {y:bmi}<0
map : {y:map}<0
tc : {y:tc}<0
ldl : {y:ldl}<0
hdl : {y:hdl}<0
tch : {y:tch}<0
ltg : {y:ltg}<0
glu : {y:glu}<0

Equal-tailed
Mean   Std. dev.     MCSE     Median  [95% cred. interval]

age       .5277   .4992571   .011353          1          0          1
sex       .9997   .0173188   .000224          1          1          1
bmi           0          0         0          0          0          0
map           0          0         0          0          0          0
tc       .8815   .3232154   .018971          1          0          1
ldl       .5301   .4991181   .029122          1          0          1
hdl       .9226   .2672384    .01198          1          0          1
tch       .1992   .3994187   .016507          0          0          1
ltg       .1992   .3994187   .016507          0          0          1
glu       .1417   .3487596   .008577          0          0          1



There are two variables, age and ldl, for which the probability that their coefficients are less than 0 is close to 0.5. So, we would drop these two variables from the model. Thus, we arrive at the same set of variables as selected earlier by lasso.

## References

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani 2004. Least angle regression. The Annals of Statistics 32: 407–499.

Park, T.,and G. Casella. 2008. Bayesian lasso. Journal of the American Statistical Association 103: 681–686.

## Tell me more

See all Bayesian features and the Stata Bayesian Analysis Reference Manual.