Genuine semiparametric Cox proportional hazards modeling

Left-censoring, right-censoring, interval-censoring

Single-record and multiple-record data format

Acommodation with time-varying covariates

Stratified estimation

Four methods to estimate standard errors, including robust and cluster–robust standard errors

Testing proportional-hazards assumption

Predictions of baseline functions, Martingale-like residuals, Cox–Snell-like residuals, and time-varying prediction

Graphs of survivor, cumulative hazard, and hazard functions

Graphical checks of proportional-hazards assumption

What is the exact time of cancer recurrence? Or time of COVID-19 infection? We don't really know exactly. All we know is that the cancer recurred sometime between two follow-up examinations and that the infection occurred sometime before the onset of symptoms or a positive COVID test. Data like these are called interval-censored time-to-event data. By time-to-event or event-time data, we also mean failure-time data, duration data, and so on. They can be stored in two different formats: single-record-per-subject format and multiple-record-per-subject format.

Interval-censored event-time data arise in many areas, including medical, epidemiological, economic, financial, and sociological studies. Ignoring interval-censoring will often lead to biased estimates.

A semiparametric Cox proportional hazards regression model is used routinely to analyze uncensored and right-censored event-time data.
In Stata, you can use the estimation command **stintcox** to fit the Cox model to interval-censored event-time data,
for both single-record and multiple-record formats.

Just as with right-censored data, a Cox model is appealing for interval-censored data because it uses semiparametric estimation and thus does not require any parametric assumptions about the baseline hazard function. Also for low-event rates, the exponentiated regression parameters approximate the log relative-risks.

Semiparametric estimation of interval-censored event-time data is challenging because none of the event times are observed exactly. Thus, "semiparametric" modeling of these data often resorted to using spline methods or piecewise-exponential models for the baseline hazard function. Genuine semiparametric modeling of interval-censored event-time data was not available until recent methodological advances, which are implemented in the **stintcox** command.

Fitting a model for single-record-per-subject interval-censored data

Fitting a model for multiple-record-per-subject interval-censored data

We use data from Zeng et al. (2016) that studies
time to HIV infection in a cohort study of injecting drug users in Thailand.
The dataset contains 1,124 subjects. Those subjects initially tested
negative for the HIV virus. They were then followed and assessed for HIV-1
seropositivity through blood tests approximately every four months.
Because the subjects were tested periodically, the exact time of HIV
infection was not observed, but the times are known to fall in intervals between
blood tests with the lower and upper endpoints recorded in variables
**ltime** and **rtime**.

The dataset also records several baseline factors, such as centered age at
recruitment (**age_mean**), sex (**male**), history of needle sharing
(**needle**), history of drug injection (**inject**), and whether a
subject has been in jail at the time of recruitment (**jail**). The dataset
contains one record per subject with the event-time information recorded as
interval data and is called single-record-per-subject data. Here is a subset
of the dataset for subjects 271–274:

.use https://www.stata-press.com/data/r18/idu(Modified Bangkok IDU Preparatory Study) .format ltime rtime age_mean %6.2f.list ltime rtime age_mean male needle inject jail if _n >=271 & _n<=274

ltime rtime age_mean male needle inject jail | |||

271. | 22.00 . -6.46 Yes Yes No No | ||

272. | 3.80 9.41 8.54 No No No Yes | ||

273. | 20.66 . -11.46 Yes Yes No No | ||

274. | 0.00 3.87 -4.46 Yes Yes Yes Yes | ||

We can fit a Cox proportional hazards model in which time to HIV infection
depends on above factors of interest. In a single-record-per-subject format,
we must specify option **interval()** with **stintcox**.

.stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime)note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -597.65634 Iteration 200: Log likelihood = -597.57555 Iteration 295: Log likelihood = -597.56443 Computing standard errors: ............................ done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Right-censored = 991 Interval-cens. = 92 Wald chi2(5) = 17.10 Log likelihood = -597.56443 Prob > chi2 = 0.0043

OPG | ||

Haz. ratio std. err. z P>|z| [95% conf. interval] | ||

age_mean | .9684341 .0126552 -2.45 0.014 .9439452 .9935582 | |

male | ||

Yes | .6846949 .1855907 -1.40 0.162 .4025073 1.164717 | |

needle | ||

Yes | 1.275912 .2279038 1.36 0.173 .8990401 1.810768 | |

inject | ||

Yes | 1.250154 .2414221 1.16 0.248 .8562184 1.825334 | |

jail | ||

Yes | 1.567244 .3473972 2.03 0.043 1.014982 2.419998 | |

We find that age at the recruitment is significantly associated with lower risk of HIV infection and being in jail at enrollment is associated with higher risk of HIV infection. Other factors are not statistically significant.

If we assume that the baseline hazard function for female is different
from that for male, we can fit a stratified Cox proportional hazards
model by using the **strata()** option:

.stintcox age_mean i.needle i.inject i.jail, interval(ltime rtime) strata(male)note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1087.0536 Iteration 100: Log likelihood = -585.59848 Iteration 200: Log likelihood = -585.53143 Iteration 282: Log likelihood = -585.5222 Computing standard errors: ........................... done Stratified interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Strata variable: male Left-censored = 41 Right-censored = 991 Event-time interval: Interval-cens. = 92 Lower endpoint: ltime Upper endpoint: rtime Wald chi2(4) = 14.84 Log likelihood = -585.5222 Prob > chi2 = 0.0051

OPG | ||

Haz. ratio std. err. z P>|z| [95% conf. interval] | ||

age_mean | .9682508 .0126326 -2.47 0.013 .9438052 .9933295 | |

needle | ||

Yes | 1.276222 .2270302 1.37 0.170 .9005422 1.808625 | |

inject | ||

Yes | 1.245357 .2393768 1.14 0.254 .8544367 1.815131 | |

jail | ||

Yes | 1.57314 .3490687 2.04 0.041 1.018337 2.430205 | |

Our conclusion here is similar to the case without stratification.

It is important to evaluate the validity of the underlying assumption for the Cox proportional hazards model, which is that the hazard ratio is constant over time. One way of testing the proportional-hazards assumption for a covariate is to test whether the coefficient associated with that covariate is time invariant. This can be accomplished by including an interaction between this covariate and a function of time in the model and testing whether the corresponding coefficient equals zero.

For example, to check the proportional-hazards assumption for **inject**,
we can include an interaction term between **inject** and time using the
**tvc()** option:

.stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime) tvc(i.inject) nohrnote: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -597.58166 Iteration 200: Log likelihood = -597.50007 Iteration 295: Log likelihood = -597.48895 Computing standard errors: ......................................... done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Event-time interval: Right-censored = 991 Lower endpoint: ltime Interval-cens. = 92 Upper endpoint: rtime Wald chi2(6) = 17.36 Log likelihood = -597.48895 Prob > chi2 = 0.0080

OPG | ||

Coefficient std. err. z P>|z| [95% conf. interval] | ||

main | ||

age_mean | -.0321889 .0131026 -2.46 0.014 -.0578696 -.0065082 | |

male | ||

Yes | -.3817124 .2714232 -1.41 0.160 -.9136921 .1502674 | |

needle | ||

Yes | .2434583 .1791214 1.36 0.174 -.1076131 .5945298 | |

inject | ||

Yes | .3194498 .3209978 1.00 0.320 -.3096943 .9485939 | |

jail | ||

Yes | .4498814 .2230002 2.02 0.044 .012809 .8869538 | |

tvc | ||

inject | ||

Yes | -.0080255 .0208446 -0.39 0.700 -.0488801 .0328292 | |

With the **nohr** option, the first equation, **main**, reports
the coefficients for the covariates that do not vary over time; the
second equation, **tvc**, reports the estimated coefficients for
covariates interacted with time. There is no evidence that the coefficient of
**inject** interacted with time is different from 0. Hence the
proprtional-hazards assumption appears to hold for covariate
**inject**.

For single-record interval-censored data, Stata also provides two graphical
commands to visually assess the proportional-hazards assumption. These two
commands can be used without running **stintcox** first.

For a single categorical covariate in a Cox model, you can use
**stintphplot**, which plots -ln{-ln(survival)} curves for each category
versus ln(analysis time). If the plots are parallel, the
proportional-hazards assumption has not been violated. Alternatively,
you can use **stintcoxnp** to plot the nonparametric maximum-likelihood
estimation survival curve versus the Cox predicted survival curve for
each category. When the two curves are close together, the
proportional-hazards assumption is valid.

When a Cox model contains multiple covariates, as mentioned in the above
example, only **stintphplot** is appropriate for graphically checking the
proportional-hazards assumption. In that case, we need to use the
**adjustfor()** option.

For example, to graphically check the proportional-hazards assumption
for **inject**, we include all covariates except **inject** in the
**adjustfor()** option:

.stintphplot, interval(ltime rtime) by(inject) adjustfor(age_mean i.male i.needle i.jail)Fitting Cox model with covariates from option adjustfor() for inject = No ... Fitting Cox model with covariates from option adjustfor() for inject = Yes ...

A separate Cox model, which contains all covariates from the
**adjustfor()** option, is fit for each level of **inject**. And
the two plots are almost parallel, which indicates that the
proportional-hazards assumption has not been violated for the
categorical variable **inject**.

Interval-censored data can also be recorded as
multiple-record-per-subject format. In this format, each subject may
contain multiple records with multiple examination times, the event
status at each examination time, and potential time-varying covariates
at each examination time. Here we use an extended version of previous
single-record-per-subject dataset, which contains all the baseline
covariates as well as a time-varying imprisonment indicator variable
(**jail_vary**) that indicates whether the subject has been
imprisioned since the last clinic visit. It also records a subject
identifier (**id**), the examination time of the blood test
(**time**), and whether the blood test is positive at the examination
time (**is_seropos**). Here is a subset of this dataset for subject
271-274:

.use https://www.stata-press.com/data/r18/idu2(Modified Bangkok IDU Preparatory Study with time-varying variable jail_vary) .format time age_mean %6.2f.list id time is_seropos age_mean male needle inject jail_vary if id >=271 & id <=274, sepby(id) noobs abbreviate(10)

id time is_seropos age_mean male needle inject jail_vary | ||

271 4.89 No -6.46 Yes Yes No No | ||

271 9.31 No -6.46 Yes Yes No No | ||

271 13.38 No -6.46 Yes Yes No Yes | ||

271 17.97 No -6.46 Yes Yes No Yes | ||

271 22.00 No -6.46 Yes Yes No No | ||

272 3.80 No 8.54 No No No Yes | ||

272 9.41 Yes 8.54 No No No No | ||

273 3.93 No -11.46 Yes Yes No No | ||

273 8.00 No -11.46 Yes Yes No No | ||

273 12.07 No -11.46 Yes Yes No Yes | ||

273 15.97 No -11.46 Yes Yes No Yes | ||

273 20.66 No -11.46 Yes Yes No Yes | ||

274 3.87 Yes -4.46 Yes Yes Yes Yes | ||

We use **stintcox** to fit a Cox proportional-hazards model in which the
time to HIV infection depends on baseline covariates **age_mean**,
**male**, **needle**, **inject**, and time-varying covariate
**jail_vary**. In a multiple-record-per-subject format, we must specify
options **id()**, **time()**, and **status()** with **stintcox**.

.stintcox age_mean i.male i.needle i.inject i.jail_vary, id(id) time(time) status(is_seropos)note: time-varying covariates detected in the data; using method nearleft to impute their values between examination times. note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -598.45375 Iteration 200: Log likelihood = -598.35872 Iteration 285: Log likelihood = -598.34887 Computing standard errors: ........................................ done Interval-censored Cox regression Number of obs = 6,453 Baseline hazard: Reduced intervals Number of subjects = 1,124 Uncensored = 0 ID variable: id Left-censored = 41 Examination time: time Right-censored = 991 Status indicator: is_seropos Interval-cens. = 92 Wald chi2(5) = 17.03 Log likelihood = -598.34887 Prob > chi2 = 0.0044

OPG | ||

time | Haz. ratio std. err. z P>|z| [95% conf. interval] | |

age_mean | .9714605 .012757 -2.20 0.027 .9467762 .9967884 | |

male | ||

Yes | .6678044 .1816576 -1.48 0.138 .3918353 1.138138 | |

needle | ||

Yes | 1.271409 .2275426 1.34 0.180 .8952546 1.805609 | |

inject | ||

Yes | 1.370672 .2575405 1.68 0.093 .9484142 1.980928 | |

jail_vary | ||

Yes | 1.440966 .2916178 1.81 0.071 .9691488 2.142481 | |

Compared with the previous example, after we account for time-varying
imprisonment, the hazard ratio for **inject** increases from 1.25 to 1.37,
but the effect of imprisonment decreases from 1.57 for baseline **jail**
to 1.44 for time-varying **jail_vary**.

After fitting the model, we can use **stcurve** to plot the estimated
survivor, failure, hazard, or cumulative hazard function. By default,
**stcurve** evaluates the functions at the overall means of covariates.

.stcurve, survivalnote: function evaluated at overall means of covariates.

With multiple-record data with time-varying covariates, we may want to
incorporate the time-varying nature of those covariates when plotting the
functions. In this case, we can use the **attmeans** option to evaluate
the function at time-specific means.

.stcurve, survival attmeansnote: function evaluated at time-specific means of covariates.

We can also use the **atframe()** option to specify our own time-varying
covariate values to be used to evaluate the survivor function. Suppose we
want to plot the survivor curve for an individual with the same covariate
pattern as subject 1 in our dataset. We can create a new frame called
**id1** and use **frame put** to copy the relevant information for
subject 1 in this new frame. Then we **list** the data we just saved in
frame **id1**.

.frame put time age_mean male needle inject jail_vary if id==1, into(id1).frame id1: list

time age_mean male needle inject jail_v~y | |||

1. | 4.13 5.54 Yes No Yes Yes | ||

2. | 8.26 5.54 Yes No Yes No | ||

3. | 12.33 5.54 Yes No Yes No | ||

4. | 16.10 5.54 Yes No Yes No | ||

5. | 20.10 5.54 Yes No Yes No | ||

6. | 24.23 5.54 Yes No Yes No | ||

7. | 28.62 5.54 Yes No Yes No | ||

8. | 32.59 5.54 Yes No Yes No | ||

9. | 36.20 5.54 Yes No Yes No | ||

10. | 40.56 5.54 Yes No Yes Yes | ||

We can graph the survivor curve for this particular profile by typing

.stcurve, survival atframe(id1)note: function evaluated at specified values of selected covariates and overall means of other covariates (if any). note: covariate values from frame id1 used to evaluate function.

Zeng, D., L. Mao, and D. Lin. 2016. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103: 253–271.S

Zeng, D., F. Gao, and D. Lin. 2017. Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104: 505–525.

Turnbull, B. W. 1976. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B 38: 290–295.

Farrington, C. P. 2000. Residuals for proportional hazards models with interval-censored survival data. Biometrics 56: 473–482.

To learn more about how to fit parametric models to interval-censored data using the **stintreg** command, see **[ST] stintreg**.