Lasso for Cox proportional hazards models

Order

Watch video demo

<- See Stata 18's new features

Highlights

Estimation: lasso and elastic net
Left-truncation and right-censoring
Selection methods
- Cross-validation
- Adaptive lasso
- BIC
- User specified
Penalized and postselection predictions
Plots of survivor and other functions

If you have time-to-event data, also known as survival-time or failure-time data, and many predictors, check out the lasso cox and elasticnet cox commands. (And when we say many predictors, we mean hundreds, thousands, or more!) New in Stata 18, these commands expand the existing lasso suite for prediction and model selection to include a high-dimensional semiparametric Cox proportional hazards model.

After lasso cox and elasticnet cox, you can use stcurve to plot the survivor, failure, hazard, or cumulative hazard function or use any of the other postestimation tools available after lasso and elasticnet.

Let's see it work

We illustrate lasso cox with an example that predicts the risk of death for stage I lung adenocarcinoma patients. Lung adenocarcinoma is one of the most common non-small cell lung cancers.

Stage I adenocarcinoma indicates that the tumor size is relatively small and the cancer has not spread to other distant organs. Stage I adenocarcinoma patients have varied survival outcomes even though they are in the early cancer-development stage. For example, Yu et al. (2016) show that, in one cohort, more than 50% of stage I adenocarcinoma patients died within 5 years after the initial diagnosis, while about 15% of the patients survived for more than 10 years.

Histopathology image features are used for prognostic analysis. We can use lasso cox to extract the top histopathology image features that distinguish short-term and long-term survivors.

We have a fictional survival dataset (lungcancer.dta) inspired by Yu et al. (2016). The variable t records either the time of death or censoring in months for stage I adenocarcinoma lung cancer patients. The indicator variable died is 1 or 0 if the patient died or is censored, respectively. There are 500 histopathology image features, histfeature1 to hisfeature500, and only 250 patients. The analysis aims to classify a new patient into a low-risk or high-risk group, given the histopathology image features.

We first load the dataset and then type stset to show that it has already been stset.

. webuse lungcancer
(Fictitious data on stage I adenocarcinoma lung cancer)

. stset
-> stset t, failure(died)

Survival-time data settings

         Failure event: died!=0 & died<.
Observed time interval: (0, t]
     Exit on or before: failure



        250  total observations
          0  exclusions

        250  observations remaining, representing
        211  failures in single-record/single-failure data
 18,465.093  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =       260

Next, we split the full sample into training and testing data. The training data will be used for estimation, and the testing data will be used to measure the prediction performance.

We use splitsample to split the data into two parts. We use the generate(group) option to create a new variable, group, that equals 1 if it belongs to the training data or 0 to the testing data. The split(0.6 0.4) option specifies that 60% of the entire data are used as training data and 40% as testing data. To make the results reproducible, we specify the rseed() option.

. splitsample, generate(group) split(0.6 0.4) rseed(12345)

For later use, we save the training data as lungcancer_training.dta and the testing data as lungcancer_testing.dta.

. preserve

. keep if group == 1
(100 observations deleted)

. save lungcancer_training, replace
file lungcancer_training.dta not found
file lungcancer_training.dta saved

. restore

. preserve

. keep if group == 2
(150 observations deleted)

. save lungcancer_testing, replace
file lungcancer_testing.dta not found
file lungcancer_testing.dta saved

. restore

We now fit a lasso cox model using only the training data. By default, we use cross-validation. We specify rseed() for reproducibility.

. use lungcancer_training, clear
(Fictitious data on stage I adenocarcinoma lung cancer)

. lasso cox histfeature*, rseed(12345671)

        Failure _d: died
  Analysis time _t: t

10-fold cross-validation with 100 lambdas ...
(output omitted)
... cross-validation complete ... minimum found

Lasso Cox model                             No. of obs        =        150
                                            No. of covariates =        500
Selection: Cross-validation                 No. of CV folds   =         10



                                          No. of
                                         nonzero    In-sample      CV mean
      ID       Description      lambda     coef.   dev. ratio     deviance

       1      first lambda    .3539123         0       0.0000     8.922501
      30     lambda before    .0918411        45       0.2199     8.042941
    * 31   selected lambda    .0876668        48       0.2306     8.039609
      32      lambda after    .0836822        52       0.2419      8.05246
      34       last lambda    .0762481        63       0.2662     8.105045

* lambda selected by cross-validation.

lasso cox selects 48 of the 500 features. We can now predict the penalized relative-hazard ratio (variable riskscore_training) and evaluate risk scores. We will use the median of riskscore_training as a threshold to classify a patient as low risk or high risk. We store the median value in a global macro (median) for later use.

. predict riskscore_training
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. summarize riskscore_training, detail


              Predicted hazard ratio, penalized

      Percentiles      Smallest
 1%      .054982       .0414753
 5%     .0838301        .054982
10%     .1308778       .0702972       Obs                 150
25%     .3676802       .0727958       Sum of wgt.         150
 
50%     .9458244                      Mean           1.998198
                        Largest       Std. dev.       3.75226
75%     2.368032       9.962103
90%     4.912702       11.13334       Variance       14.07945
95%     6.651043        12.4411       Skewness       7.054249
99%      12.4411       39.40631       Kurtosis       67.68195


. global median = r(p50)

We now use the testing data to validate the model. First, we predict the penalized hazard ratio (variable riskscore_testing) in the testing sample. Then, we compare riskscore_testing with the median of the hazard ratio obtained in the training data ($median). The patient is labeled high risk if the predicted risk score is greater than or equal to the median. The patient is classified as low risk if the predicted risk score is less than the median.

. use lungcancer_testing, clear
(Fictitious data on stage I adenocarcinoma lung cancer)

. predict riskscore_testing
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. generate byte risk = (riskscore_testing >= $median)

. label define risk_lb 1 "High risk" 0 "Low risk"

. label values risk risk_lb

To evaluate the effectiveness of risk classification, we first look at the Kaplan–Meier plot, which plots the survival curve for both low-risk and high-risk groups.

. sts graph, by(risk)

The graph shows that the predicted high-risk patients have a more steeply falling survival curve than the predicted low-risk patients. To confirm this conjecture, we do a log-rank test.

. sts test risk

        Failure _d: died
  Analysis time _t: t

Equality of survivor functions
Log-rank test



             Observed       Expected
risk           events         events

 Low risk          39          68.17
High risk          51          21.83

    Total          90          90.00


                    chi2(1) =  61.50
                    Pr>chi2 = 0.0000

The log-rank test rejects the hypothesis that the predicted low-risk and high-risk patients have the same survival functions. Both the Kaplan–Meier plot and the log-rank test show that using the predicted hazard ratios' median can effectively distinguish a low-risk patient from a high-risk patient. We can now make prognostic predictions given new data.

The dataset (newlungcancer.dta) contains histopathology image features for some new stage I adenocarcinoma patients, and we do not observe their survival time yet because they are still alive. Based on the prediction model from lasso cox, we want to classify these new patients as low risk or high risk. To achieve this objective, we need to predict only the new patients' hazard ratios and compare them with the median level of risk score obtained in the training data.

. webuse newlungcancer, clear
(Fictitious new data on stage I adenocarcinoma lung cancer)

. predict riskscore_new
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. generate risk = (riskscore_new >= $median)

. label define risk_lb 1 "High risk" 0 "Low risk"

. label values risk risk_lb

. tabulate risk



       risk        Freq.     Percent        Cum.

   Low risk           27       54.00       54.00
  High risk           23       46.00      100.00

      Total           50      100.00

The table of the predicted risk level shows that 27 patients are classified as low risk, while 23 patients are classified as high risk.

Reference

Yu, K., C. Zhang, G. J. Berry, R. B. Altman, C. Ré, D. L. Rubin, and M. Snyder. 2016. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nature Communications 7(12474).

Tell me more

Learn more about other new features in survival analysis.

Learn more about other lasso features in Lasso for prediction and model selection.

View all the new features in Stata 18.

Made for data science.

Get started today.

Order

Upgrade

2024 Stata Conference · 1-2 August · Portland, OR

View the program →

View the program →