Home  /  Products  /  Features  /  Interval-censored Cox model

<-  See Stata's other features

Highlights

  • Genuine semiparametric Cox proportional hazards modeling

  • Left-censoring, right-censoring, interval-censoring

  • Single-record and multiple-record data format

  • Acommodation with time-varying covariates

  • Stratified estimation

  • Four methods to estimate standard errors, including robust and cluster–robust standard errors

  • Testing proportional-hazards assumption

  • Predictions of baseline functions, Martingale-like residuals, Cox–Snell-like residuals, and time-varying prediction

  • Graphs of survivor, cumulative hazard, and hazard functions

  • Graphical checks of proportional-hazards assumption

What is the exact time of cancer recurrence? Or time of COVID-19 infection? We don't really know exactly. All we know is that the cancer recurred sometime between two follow-up examinations and that the infection occurred sometime before the onset of symptoms or a positive COVID test. Data like these are called interval-censored time-to-event data. By time-to-event or event-time data, we also mean failure-time data, duration data, and so on. They can be stored in two different formats: single-record-per-subject format and multiple-record-per-subject format.

Interval-censored event-time data arise in many areas, including medical, epidemiological, economic, financial, and sociological studies. Ignoring interval-censoring will often lead to biased estimates.

A semiparametric Cox proportional hazards regression model is used routinely to analyze uncensored and right-censored event-time data. In Stata, you can use the estimation command stintcox to fit the Cox model to interval-censored event-time data, for both single-record and multiple-record formats.

Just as with right-censored data, a Cox model is appealing for interval-censored data because it uses semiparametric estimation and thus does not require any parametric assumptions about the baseline hazard function. Also for low-event rates, the exponentiated regression parameters approximate the log relative-risks.

Semiparametric estimation of interval-censored event-time data is challenging because none of the event times are observed exactly. Thus, "semiparametric" modeling of these data often resorted to using spline methods or piecewise-exponential models for the baseline hazard function. Genuine semiparametric modeling of interval-censored event-time data was not available until recent methodological advances, which are implemented in the stintcox command.

Let's see it work

Fitting a model for single-record-per-subject interval-censored data

We use data from Zeng et al. (2016) that studies time to HIV infection in a cohort study of injecting drug users in Thailand. The dataset contains 1,124 subjects. Those subjects initially tested negative for the HIV virus. They were then followed and assessed for HIV-1 seropositivity through blood tests approximately every four months. Because the subjects were tested periodically, the exact time of HIV infection was not observed, but the times are known to fall in intervals between blood tests with the lower and upper endpoints recorded in variables ltime and rtime.

The dataset also records several baseline factors, such as centered age at recruitment (age_mean), sex (male), history of needle sharing (needle), history of drug injection (inject), and whether a subject has been in jail at the time of recruitment (jail). The dataset contains one record per subject with the event-time information recorded as interval data and is called single-record-per-subject data. Here is a subset of the dataset for subjects 271–274:

. use https://www.stata-press.com/data/r18/idu
(Modified Bangkok IDU Preparatory Study)

. format ltime rtime age_mean %6.2f

. list ltime rtime age_mean male needle inject jail if _n >=271
     & _n<=274

ltime rtime age_mean male needle inject jail
271. 22.00 . -6.46 Yes Yes No No
272. 3.80 9.41 8.54 No No No Yes
273. 20.66 . -11.46 Yes Yes No No
274. 0.00 3.87 -4.46 Yes Yes Yes Yes

We can fit a Cox proportional hazards model in which time to HIV infection depends on above factors of interest. In a single-record-per-subject format, we must specify option interval() with stintcox.

. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime)
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:    Log likelihood = -1086.2564
Iteration 100:  Log likelihood = -597.65634
Iteration 200:  Log likelihood = -597.57555
Iteration 295:  Log likelihood = -597.56443

Computing standard errors: ............................ done

Interval-censored Cox regression                    Number of obs     =  1,124
Baseline hazard: Reduced intervals                         Uncensored =      0
                                                        Left-censored =     41
                                                       Right-censored =    991
                                                       Interval-cens. =     92

                                                    Wald chi2(5)      =  17.10
Log likelihood = -597.56443                         Prob > chi2       = 0.0043

OPG
Haz. ratio std. err. z P>|z| [95% conf. interval]
age_mean .9684341 .0126552 -2.45 0.014 .9439452 .9935582
male
Yes .6846949 .1855907 -1.40 0.162 .4025073 1.164717
needle
Yes 1.275912 .2279038 1.36 0.173 .8990401 1.810768
inject
Yes 1.250154 .2414221 1.16 0.248 .8562184 1.825334
jail
Yes 1.567244 .3473972 2.03 0.043 1.014982 2.419998
Note: Standard-error estimates may be more variable for small datasets and datasets with low proportions of interval-censored observations.

We find that age at the recruitment is significantly associated with lower risk of HIV infection and being in jail at enrollment is associated with higher risk of HIV infection. Other factors are not statistically significant.

Stratification

If we assume that the baseline hazard function for female is different from that for male, we can fit a stratified Cox proportional hazards model by using the strata() option:

. stintcox age_mean i.needle i.inject i.jail, interval(ltime rtime) strata(male)
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:   Log likelihood = -1087.0536
Iteration 100: Log likelihood = -585.59848
Iteration 200: Log likelihood = -585.53143
Iteration 282: Log likelihood =  -585.5222

Computing standard errors: ........................... done

Stratified interval-censored Cox regression         Number of obs     =  1,124
Baseline hazard: Reduced intervals                         Uncensored =      0
Strata variable: male                                   Left-censored =     41
                                                       Right-censored =    991
Event-time interval:                                   Interval-cens. =     92
  Lower endpoint: ltime
  Upper endpoint: rtime
                                                    Wald chi2(4)      =  14.84
Log likelihood = -585.5222                          Prob > chi2       = 0.0051

OPG
Haz. ratio std. err. z P>|z| [95% conf. interval]
age_mean .9682508 .0126326 -2.47 0.013 .9438052 .9933295
needle
Yes 1.276222 .2270302 1.37 0.170 .9005422 1.808625
inject
Yes 1.245357 .2393768 1.14 0.254 .8544367 1.815131
jail
Yes 1.57314 .3490687 2.04 0.041 1.018337 2.430205
Note: Standard error estimates may be more variable for small datasets and datasets with low proportions of interval-censored observations.

Our conclusion here is similar to the case without stratification.

Checking proportional hazards assumption

It is important to evaluate the validity of the underlying assumption for the Cox proportional hazards model, which is that the hazard ratio is constant over time. One way of testing the proportional-hazards assumption for a covariate is to test whether the coefficient associated with that covariate is time invariant. This can be accomplished by including an interaction between this covariate and a function of time in the model and testing whether the corresponding coefficient equals zero.

For example, to check the proportional-hazards assumption for inject, we can include an interaction term between inject and time using the tvc() option:

. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime)
     tvc(i.inject) nohr
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:   Log likelihood = -1086.2564
Iteration 100: Log likelihood = -597.58166
Iteration 200: Log likelihood = -597.50007
Iteration 295: Log likelihood = -597.48895

Computing standard errors: ......................................... done

Interval-censored Cox regression                    Number of obs     =  1,124
Baseline hazard: Reduced intervals                         Uncensored =      0
                                                        Left-censored =     41
Event-time interval:                                   Right-censored =    991
  Lower endpoint: ltime                                Interval-cens. =     92
  Upper endpoint: rtime
                                                    Wald chi2(6)      =  17.36
Log likelihood = -597.48895                         Prob > chi2       = 0.0080

OPG
Coefficient std. err. z P>|z| [95% conf. interval]
main
age_mean -.0321889 .0131026 -2.46 0.014 -.0578696 -.0065082
male
Yes -.3817124 .2714232 -1.41 0.160 -.9136921 .1502674
needle
Yes .2434583 .1791214 1.36 0.174 -.1076131 .5945298
inject
Yes .3194498 .3209978 1.00 0.320 -.3096943 .9485939
jail
Yes .4498814 .2230002 2.02 0.044 .012809 .8869538
tvc
inject
Yes -.0080255 .0208446 -0.39 0.700 -.0488801 .0328292
Notes: Standard error estimates may be more variable for small datasets and datasets with low proportions of interval-censored observations. Variables in tvc equation interacted with _t. Wald test that [tvc] = 0: chi2(1) = 0.1482 Prob > chi2 = 0.7002

With the nohr option, the first equation, main, reports the coefficients for the covariates that do not vary over time; the second equation, tvc, reports the estimated coefficients for covariates interacted with time. There is no evidence that the coefficient of inject interacted with time is different from 0. Hence the proprtional-hazards assumption appears to hold for covariate inject.

For single-record interval-censored data, Stata also provides two graphical commands to visually assess the proportional-hazards assumption. These two commands can be used without running stintcox first.

For a single categorical covariate in a Cox model, you can use stintphplot, which plots -ln{-ln(survival)} curves for each category versus ln(analysis time). If the plots are parallel, the proportional-hazards assumption has not been violated. Alternatively, you can use stintcoxnp to plot the nonparametric maximum-likelihood estimation survival curve versus the Cox predicted survival curve for each category. When the two curves are close together, the proportional-hazards assumption is valid.

When a Cox model contains multiple covariates, as mentioned in the above example, only stintphplot is appropriate for graphically checking the proportional-hazards assumption. In that case, we need to use the adjustfor() option.

For example, to graphically check the proportional-hazards assumption for inject, we include all covariates except inject in the adjustfor() option:

. stintphplot, interval(ltime rtime) by(inject) adjustfor(age_mean i.male i.needle i.jail)

Fitting Cox model with covariates from option adjustfor()
for inject = No ...

Fitting Cox model with covariates from option adjustfor()
for inject = Yes ...

A separate Cox model, which contains all covariates from the adjustfor() option, is fit for each level of inject. And the two plots are almost parallel, which indicates that the proportional-hazards assumption has not been violated for the categorical variable inject.

Fitting a model for multiple-record-per-subject interval-censored data

Interval-censored data can also be recorded as multiple-record-per-subject format. In this format, each subject may contain multiple records with multiple examination times, the event status at each examination time, and potential time-varying covariates at each examination time. Here we use an extended version of previous single-record-per-subject dataset, which contains all the baseline covariates as well as a time-varying imprisonment indicator variable (jail_vary) that indicates whether the subject has been imprisioned since the last clinic visit. It also records a subject identifier (id), the examination time of the blood test (time), and whether the blood test is positive at the examination time (is_seropos). Here is a subset of this dataset for subject 271-274:

. use https://www.stata-press.com/data/r18/idu2
(Modified Bangkok IDU Preparatory Study with time-varying variable jail_vary)

. format time age_mean %6.2f

. list id time is_seropos age_mean male needle inject jail_vary
     if id >=271 & id <=274, sepby(id) noobs abbreviate(10)

id time is_seropos age_mean male needle inject jail_vary
271 4.89 No -6.46 Yes Yes No No
271 9.31 No -6.46 Yes Yes No No
271 13.38 No -6.46 Yes Yes No Yes
271 17.97 No -6.46 Yes Yes No Yes
271 22.00 No -6.46 Yes Yes No No
272 3.80 No 8.54 No No No Yes
272 9.41 Yes 8.54 No No No No
273 3.93 No -11.46 Yes Yes No No
273 8.00 No -11.46 Yes Yes No No
273 12.07 No -11.46 Yes Yes No Yes
273 15.97 No -11.46 Yes Yes No Yes
273 20.66 No -11.46 Yes Yes No Yes
274 3.87 Yes -4.46 Yes Yes Yes Yes

We use stintcox to fit a Cox proportional-hazards model in which the time to HIV infection depends on baseline covariates age_mean, male, needle, inject, and time-varying covariate jail_vary. In a multiple-record-per-subject format, we must specify options id(), time(), and status() with stintcox.

. stintcox age_mean i.male i.needle i.inject i.jail_vary, id(id) time(time)
     status(is_seropos)
note: time-varying covariates detected in the data; using method nearleft to
      impute their values between examination times.
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:   Log likelihood = -1086.2564
Iteration 100: Log likelihood = -598.45375
Iteration 200: Log likelihood = -598.35872
Iteration 285: Log likelihood = -598.34887

Computing standard errors: ........................................ done

Interval-censored Cox regression                   Number of obs      =  6,453
Baseline hazard: Reduced intervals                 Number of subjects =  1,124
                                                           Uncensored =      0
ID variable: id                                         Left-censored =     41
Examination time: time                                 Right-censored =    991
Status indicator: is_seropos                           Interval-cens. =     92

                                                   Wald chi2(5)       =  17.03
Log likelihood = -598.34887                        Prob > chi2        = 0.0044

OPG
time Haz. ratio std. err. z P>|z| [95% conf. interval]
age_mean .9714605 .012757 -2.20 0.027 .9467762 .9967884
male
Yes .6678044 .1816576 -1.48 0.138 .3918353 1.138138
needle
Yes 1.271409 .2275426 1.34 0.180 .8952546 1.805609
inject
Yes 1.370672 .2575405 1.68 0.093 .9484142 1.980928
jail_vary
Yes 1.440966 .2916178 1.81 0.071 .9691488 2.142481
Time varying: jail_vary Note: Standard error estimates may be more variable for small datasets and datasets with low proportions of interval-censored observations.

Compared with the previous example, after we account for time-varying imprisonment, the hazard ratio for inject increases from 1.25 to 1.37, but the effect of imprisonment decreases from 1.57 for baseline jail to 1.44 for time-varying jail_vary.

Graphing survivor functions

After fitting the model, we can use stcurve to plot the estimated survivor, failure, hazard, or cumulative hazard function. By default, stcurve evaluates the functions at the overall means of covariates.

. stcurve, survival
note: function evaluated at overall means of covariates.

With multiple-record data with time-varying covariates, we may want to incorporate the time-varying nature of those covariates when plotting the functions. In this case, we can use the attmeans option to evaluate the function at time-specific means.

. stcurve, survival attmeans
note: function evaluated at time-specific means of covariates.

We can also use the atframe() option to specify our own time-varying covariate values to be used to evaluate the survivor function. Suppose we want to plot the survivor curve for an individual with the same covariate pattern as subject 1 in our dataset. We can create a new frame called id1 and use frame put to copy the relevant information for subject 1 in this new frame. Then we list the data we just saved in frame id1.

. frame put time age_mean male needle inject jail_vary
     if id==1, into(id1)

. frame id1: list

time age_mean male needle inject jail_v~y
1. 4.13 5.54 Yes No Yes Yes
2. 8.26 5.54 Yes No Yes No
3. 12.33 5.54 Yes No Yes No
4. 16.10 5.54 Yes No Yes No
5. 20.10 5.54 Yes No Yes No
6. 24.23 5.54 Yes No Yes No
7. 28.62 5.54 Yes No Yes No
8. 32.59 5.54 Yes No Yes No
9. 36.20 5.54 Yes No Yes No
10. 40.56 5.54 Yes No Yes Yes

We can graph the survivor curve for this particular profile by typing

. stcurve, survival atframe(id1)
note: function evaluated at specified values of selected covariates and overall
      means of other covariates (if any).
note: covariate values from frame id1 used to evaluate function.

References

Zeng, D., L. Mao, and D. Lin. 2016. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103: 253–271.S

Zeng, D., F. Gao, and D. Lin. 2017. Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104: 505–525.

Turnbull, B. W. 1976. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B 38: 290–295.

Farrington, C. P. 2000. Residuals for proportional hazards models with interval-censored survival data. Biometrics 56: 473–482.

Tell me more

To learn more about how to fit parametric models to interval-censored data using the stintreg command, see [ST] stintreg.