<- See Stata 18's new features
Highlights
Estimation: lasso and elastic net
Left-truncation and right-censoring
Selection methods
Cross-validation
Adaptive lasso
BIC
User specified
Penalized and postselection predictions
Plots of survivor and other functions
If you have time-to-event data, also known as survival-time or failure-time data, and many predictors, check out the lasso cox and elasticnet cox commands. (And when we say many predictors, we mean hundreds, thousands, or more!) New in Stata 18, these commands expand the existing lasso suite for prediction and model selection to include a high-dimensional semiparametric Cox proportional hazards model.
After lasso cox and elasticnet cox, you can use stcurve to plot the survivor, failure, hazard, or cumulative hazard function or use any of the other postestimation tools available after lasso and elasticnet.
We illustrate lasso cox with an example that predicts the risk of death for stage I lung adenocarcinoma patients. Lung adenocarcinoma is one of the most common non-small cell lung cancers.
Stage I adenocarcinoma indicates that the tumor size is relatively small and the cancer has not spread to other distant organs. Stage I adenocarcinoma patients have varied survival outcomes even though they are in the early cancer-development stage. For example, Yu et al. (2016) show that, in one cohort, more than 50% of stage I adenocarcinoma patients died within 5 years after the initial diagnosis, while about 15% of the patients survived for more than 10 years.
Histopathology image features are used for prognostic analysis. We can use lasso cox to extract the top histopathology image features that distinguish short-term and long-term survivors.
We have a fictional survival dataset (lungcancer.dta) inspired by Yu et al. (2016). The variable t records either the time of death or censoring in months for stage I adenocarcinoma lung cancer patients. The indicator variable died is 1 or 0 if the patient died or is censored, respectively. There are 500 histopathology image features, histfeature1 to hisfeature500, and only 250 patients. The analysis aims to classify a new patient into a low-risk or high-risk group, given the histopathology image features.
We first load the dataset and then type stset to show that it has already been stset.
. webuse lungcancer (Fictitious data on stage I adenocarcinoma lung cancer) . stset -> stset t, failure(died) Survival-time data settings Failure event: died!=0 & died<. Observed time interval: (0, t] Exit on or before: failure
250 total observations |
0 exclusions |
250 observations remaining, representing |
211 failures in single-record/single-failure data |
18,465.093 total analysis time at risk and under observation |
At risk from t = 0 |
Earliest observed entry t = 0 |
Last observed exit t = 260 |
Next, we split the full sample into training and testing data. The training data will be used for estimation, and the testing data will be used to measure the prediction performance.
We use splitsample to split the data into two parts. We use the generate(group) option to create a new variable, group, that equals 1 if it belongs to the training data or 0 to the testing data. The split(0.6 0.4) option specifies that 60% of the entire data are used as training data and 40% as testing data. To make the results reproducible, we specify the rseed() option.
. splitsample, generate(group) split(0.6 0.4) rseed(12345)
For later use, we save the training data as lungcancer_training.dta and the testing data as lungcancer_testing.dta.
. preserve . keep if group == 1 (100 observations deleted) . save lungcancer_training, replace file lungcancer_training.dta not found file lungcancer_training.dta saved . restore . preserve . keep if group == 2 (150 observations deleted) . save lungcancer_testing, replace file lungcancer_testing.dta not found file lungcancer_testing.dta saved . restore
We now fit a lasso cox model using only the training data. By default, we use cross-validation. We specify rseed() for reproducibility.
. use lungcancer_training, clear (Fictitious data on stage I adenocarcinoma lung cancer) . lasso cox histfeature*, rseed(12345671) Failure _d: died Analysis time _t: t 10-fold cross-validation with 100 lambdas ... (output omitted) ... cross-validation complete ... minimum found Lasso Cox model No. of obs = 150 No. of covariates = 500 Selection: Cross-validation No. of CV folds = 10
No. of | ||
nonzero In-sample CV mean | ||
ID | Description lambda coef. dev. ratio deviance | |
1 | first lambda .3539123 0 0.0000 8.922501 | |
30 | lambda before .0918411 45 0.2199 8.042941 | |
* 31 | selected lambda .0876668 48 0.2306 8.039609 | |
32 | lambda after .0836822 52 0.2419 8.05246 | |
34 | last lambda .0762481 63 0.2662 8.105045 | |
lasso cox selects 48 of the 500 features. We can now predict the penalized relative-hazard ratio (variable riskscore_training) and evaluate risk scores. We will use the median of riskscore_training as a threshold to classify a patient as low risk or high risk. We store the median value in a global macro (median) for later use.
. predict riskscore_training (options hr penalized assumed; predicted hazard ratio with penalized coefficients) . summarize riskscore_training, detail
Predicted hazard ratio, penalized |
Percentiles Smallest |
1% .054982 .0414753 |
5% .0838301 .054982 |
10% .1308778 .0702972 Obs 150 |
25% .3676802 .0727958 Sum of wgt. 150 |
50% .9458244 Mean 1.998198 |
Largest Std. dev. 3.75226 |
75% 2.368032 9.962103 |
90% 4.912702 11.13334 Variance 14.07945 |
95% 6.651043 12.4411 Skewness 7.054249 |
99% 12.4411 39.40631 Kurtosis 67.68195 |
We now use the testing data to validate the model. First, we predict the penalized hazard ratio (variable riskscore_testing) in the testing sample. Then, we compare riskscore_testing with the median of the hazard ratio obtained in the training data ($median). The patient is labeled high risk if the predicted risk score is greater than or equal to the median. The patient is classified as low risk if the predicted risk score is less than the median.
. use lungcancer_testing, clear (Fictitious data on stage I adenocarcinoma lung cancer) . predict riskscore_testing (options hr penalized assumed; predicted hazard ratio with penalized coefficients) . generate byte risk = (riskscore_testing >= $median) . label define risk_lb 1 "High risk" 0 "Low risk" . label values risk risk_lb
To evaluate the effectiveness of risk classification, we first look at the Kaplan–Meier plot, which plots the survival curve for both low-risk and high-risk groups.
. sts graph, by(risk)
The graph shows that the predicted high-risk patients have a more steeply falling survival curve than the predicted low-risk patients. To confirm this conjecture, we do a log-rank test.
. sts test risk Failure _d: died Analysis time _t: t Equality of survivor functions Log-rank test
Observed Expected | ||
risk | events events | |
Low risk | 39 68.17 | |
High risk | 51 21.83 | |
Total | 90 90.00 |
The log-rank test rejects the hypothesis that the predicted low-risk and high-risk patients have the same survival functions. Both the Kaplan–Meier plot and the log-rank test show that using the predicted hazard ratios' median can effectively distinguish a low-risk patient from a high-risk patient. We can now make prognostic predictions given new data.
The dataset (newlungcancer.dta) contains histopathology image features for some new stage I adenocarcinoma patients, and we do not observe their survival time yet because they are still alive. Based on the prediction model from lasso cox, we want to classify these new patients as low risk or high risk. To achieve this objective, we need to predict only the new patients' hazard ratios and compare them with the median level of risk score obtained in the training data.
. webuse newlungcancer, clear (Fictitious new data on stage I adenocarcinoma lung cancer) . predict riskscore_new (options hr penalized assumed; predicted hazard ratio with penalized coefficients) . generate risk = (riskscore_new >= $median) . label define risk_lb 1 "High risk" 0 "Low risk" . label values risk risk_lb . tabulate risk
risk | Freq. Percent Cum. | |
Low risk | 27 54.00 54.00 | |
High risk | 23 46.00 100.00 | |
Total | 50 100.00 |
The table of the predicted risk level shows that 27 patients are classified as low risk, while 23 patients are classified as high risk.
Yu, K., C. Zhang, G. J. Berry, R. B. Altman, C. Ré, D. L. Rubin, and M. Snyder. 2016. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nature Communications 7(12474).
Learn more about other new features in survival analysis.
Learn more about other lasso features in Lasso for prediction and model selection.
Read more in the Stata Lasso Reference Manual; see [LASSO] lasso and [LASSO] elasticnet.
View all the new features in Stata 18.