Home  /  Products  /  Features  /  Bayesian lasso

<-  See Stata's other features

Highlights

  • Laplace prior for specifying L1 penalty for coefficients

  • Gamma prior for the penalty hyperparameter

  • Bayesian criteria for selecting important predictors

Stata has a suite of commands for performing lasso-based prediction and inference. But did you know that you can fit Bayesian lasso linear models by using bayes: regress? And that inference can be performed in the usual Bayesian way, just as with any other model?

Lasso uses L1-constrained least squares to estimate its coefficients. This involves selecting a value for the penalty parameter, which is often done using cross-validation.

The Bayesian approach is different. Constraints are imposed through prior distributions. The Laplace prior provides constraints that have the same analytic form as the L1 penalty used in lasso. The Laplace prior introduces the penalty parameter as one more model parameter that needs to be estimated from the data. One advantage of the Bayesian approach is that the reported credible intervals can be used to obtain proper inference for the parameters of interest. Within the frequentist approach, special methods are needed to obtain proper inference; see Lasso for inference.

To fit a lasso-style model, use the bayes: prefix or the bayesmh command with Laplace priors. No variable selection is performed automatically, but Bayesian analysis offers various ways to select variables. One is to use bayesstats summary to display a table of the posterior probabilities that each coefficient is different from 0 and select variables based on them.

Let's look at an example. Consider the diabetes data (Efron et al. 2004) on 442 diabetes patients that record a measure of disease progression (one year after baseline) as an outcome y and 10 baseline covariates: age, sex, body mass index, mean arterial pressure, and 6 blood serum measurements. We consider the version of these data in which the covariates are standardized to have zero mean and the sum of squares across all observations of one.

We would like to fit a linear regression model to y and determine which variables are important for predicting y. Efron et al. (2004) applied the least-angle-regression variable-selection method to these data. Park and Casella (2008) introduced Bayesian lasso and used it to analyze these same data. Below, we follow their approach to demonstrate how this can be done using bayes: regress.

For comparison, let's first use the traditional lasso command to select the important predictors.

. lasso linear y age sex bmi map tc ldl hdl tch ltg glu, nolog rseed(16)

Lasso linear model                          No. of obs        =        442
                                            No. of covariates =         10
Selection: Cross-validation                 No. of CV folds   =         10

No. of Out-of- CV mean
nonzero sample prediction
ID Description lambda coef. R-squared error
1 first lambda 45.16003 0 0.0008 5934.909
47 lambda before .6254151 8 0.4896 3026.453
* 48 selected lambda .5698549 8 0.4896 3026.42
49 lambda after .5192306 8 0.4896 3026.567
83 last lambda .0219595 10 0.4891 3029.735
* lambda selected by cross-validation.

The estimated penalty parameter lambda for the selected model is 0.57. We can see the selected coefficients and their penalized estimates by typing

. lassocoef, display(coef, penalized)

active
sex -213.4102
bmi 524.8137
map 306.6641
tc -154.2327
hdl -184.5127
tch 58.66197
ltg 523.1158
glu 60.12939
_cons 152.1335

We now use bayes: regress to fit a Bayesian linear lasso model as described by Park and Casella (2008).

Let's look at the command specification first.

. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2}))))
     prior({sigma2},  jeffreys)
     prior({y:_cons}, normal(0, 1e6))
     prior({lam2=1},  gamma(1, 1/1.78))
     block({y:} {sigma2} {lam2}, split)
     rseed(16) dots:
     : regress y age sex bmi map tc ldl hdl tch ltg glu

For the coefficients, we use the Laplace prior with zero mean and the scale parameter that depends on the variance {sigma2} and squared penalization hyperparameter {lam2}. The authors suggest using the gamma prior with shape 1 and rate 1.78 (or scale 1/1.78) as the prior for {lam2}. For the intercept {y:_cons} and variance, we use vague priors: normal(0, 1e6) and jeffreys. To improve sampling efficiency for these data, we sample all parameters in separate blocks.

Let's now run the model.

. bayes, prior({y:age sex bmi map tc ldl hdl tch ltg glu}, laplace(0, (sqrt({sigma2}/{lam2}))))
     prior({sigma2},  jeffreys)
     prior({y:_cons}, normal(0, 1e6))
     prior({lam2=1},  gamma(1, 1/1.78))
     block({y:} {sigma2} {lam2}, split)
     rseed(16) dots
     : regress y age sex bmi map tc ldl hdl tch ltg glu

Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done

Model summary
Likelihood: y ~ regress(xb_y,{sigma2}) Priors: {y:age sex bmi map tc ldl hdl tch ltg glu} ~ laplace(0,) (1) {y:_cons} ~ normal(0,1e6) (1) {sigma2} ~ jeffreys Hyperprior: {lam2} ~ gamma(1,1/1.78) Expression: expr1 : sqrt({sigma2}/{lam2})
(1) Parameters are elements of the linear form xb_y. Bayesian linear regression MCMC iterations = 12,500 Random-walk Metropolis-Hastings sampling Burn-in = 2,500 MCMC sample size = 10,000 Number of obs = 442 Acceptance rate = .4379 Efficiency: min = .0152 avg = .1025 Log marginal-likelihood = -2415.7171 max = .2299
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
y
age -2.478525 52.97851 1.26623 -2.593401 -108.3303 104.2442
sex -209.4461 61.21979 1.70006 -211.1479 -330.2515 -85.52568
bmi 522.1367 66.76557 1.8115 520.6348 393.4224 656.4993
map 304.1617 65.26244 1.77912 306.1749 175.0365 432.3554
tc -172.2847 157.5097 12.7739 -159.4816 -523.0447 110.226
ldl 1.304382 128.3598 9.96343 -7.796492 -251.2571 298.4382
hdl -158.8146 112.6562 6.82563 -158.1347 -378.4126 48.93263
tch 91.27437 111.8483 6.06667 86.32462 -114.675 319.0824
ltg 515.5167 94.06607 5.83902 509.9952 342.9893 715.739
glu 67.94583 62.86024 1.69235 66.11433 -51.1174 197.7894
_cons 152.0964 2.545592 .053095 152.0963 146.9166 157.1342
sigma2 2961.246 207.0183 4.79372 2949.282 2587.023 3389.206
lam2 .0889046 .055257 .001899 .0769573 .020454 .229755

The coefficient estimates above are somewhat similar to the penalized estimates of the variables selected by lasso.

We can compare the estimates of the penalized parameters:

. bayesstats summary (lambda:sqrt({lam2}))

Posterior summary statistics                      MCMC sample size =    10,000

      lambda : sqrt({lam2})

Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
lambda .285446 .0861738 .003292 .2774119 .1430174 .4793277

Bayesian lasso estimated lambda to be 0.29, which is smaller than lasso's 0.57, but this estimate is not directly comparable because lasso standardizes covariates to have the scale of 1 during the computation.

Although Bayesian lasso does not automatically select variables, we can use some of Bayesian postestimation tools to help us select the variables. For instance, we can use bayesstats summary to compute the posterior probabilities that each coefficient is less than 0. If the probability is close to 0.5, then the coefficient estimate is centered around 0, and the corresponding variable should not be selected in the model.

. bayesstats summary (age:{y:age}<0) (sex:{y:sex}<0) (bmi:{y:bmi}<0)
     (map:{y:map}<0) (tc:{y:tc}<0) (ldl:{y:ldl}<0) (hdl:{y:hdl}<0)
     (tch:{y:tch}<0) (ltg:{y:ltg}<0) (glu:{y:glu}<0)


Posterior summary statistics                      MCMC sample size =    10,000

         age : {y:age}<0
         sex : {y:sex}<0
         bmi : {y:bmi}<0
         map : {y:map}<0
          tc : {y:tc}<0
         ldl : {y:ldl}<0
         hdl : {y:hdl}<0
         tch : {y:tch}<0
         ltg : {y:ltg}<0
         glu : {y:glu}<0

Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
age .5277 .4992571 .011353 1 0 1
sex .9997 .0173188 .000224 1 1 1
bmi 0 0 0 0 0 0
map 0 0 0 0 0 0
tc .8815 .3232154 .018971 1 0 1
ldl .5301 .4991181 .029122 1 0 1
hdl .9226 .2672384 .01198 1 0 1
tch .1992 .3994187 .016507 0 0 1
ltg .1992 .3994187 .016507 0 0 1
glu .1417 .3487596 .008577 0 0 1

There are two variables, age and ldl, for which the probability that their coefficients are less than 0 is close to 0.5. So, we would drop these two variables from the model. Thus, we arrive at the same set of variables as selected earlier by lasso.

References

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani 2004. Least angle regression. The Annals of Statistics 32: 407–499.

Park, T.,and G. Casella. 2008. Bayesian lasso. Journal of the American Statistical Association 103: 681–686.