TVCs with interval-censored Cox model

Order

Watch video demo

<- See Stata 18's new features

Highlights

Single- and multiple-record data formats
Time-varying covariates (TVCs):
- Created automatically as deterministic functions of time in both formats
- Supplied directly in a multiple-record data format
- Use to test the proportional-hazards assumption
Predictions with TVCs
Plots of functions with TVCs: survivor, cumulative hazard, and more

Stata 17 introduced the stintcox command to fit genuine semiparametric Cox models to interval-censored event-time data. Stata 18 adds support for time-varying covariates (TVCs), commonly used in practice to record characteristics that change during follow-up. So, whether you have a biomarker variable or a policy variable that changes with time, you can now include them to model events of interest known only to lie within some time interval. For instance, the event may be a recurrence of cancer or a change in employment status recorded during one of the follow-ups. Both are examples of interval-censored event-time data, in which the event time is not observed exactly.

Use stintcox to create TVCs automatically by interacting existing covariates with specified deterministic functions of time or specify your own TVCs in the new multiple-record-per-subject data format. Use TVCs to test the proportional-hazards assumption with the Wald test or likelihood-ratio test. And incorporate TVCs in your predictions and plots of survivor and other functions.

Let's see it work

-> Testing the proportional-hazards assumption using TVCs

-> Fitting an interval-censored Cox model using TVCs in multiple-record data

-> Graphing survivor functions

Testing the proportional-hazards assumption using TVCs

Here we demonstrate how to use TVCs to test the proportional-hazards assumption of a Cox model. Following Zeng et al. (2016), we use an interval-censored Cox regression to model time to HIV infection of injecting drug users. All 1,124 subjects initially tested negative for HIV. They were followed and assessed for HIV-1 seropositivity through blood tests approximately every four months. Because the subjects were tested periodically, the exact time of HIV infection was not observed, but the times were known to fall in intervals between blood tests. The corresponding lower and upper time endpoints are recorded in variables ltime and rtime.

The baseline factors we use to model the HIV-1 seropositivity are centered age at recruitment (age_mean), sex (male), history of needle sharing (needle), history of drug injection (inject), and whether a subject has been in jail at the time of recruitment (jail). Here is a subset of the dataset for subjects 271 through 274 with one record per subject:

. use https://www.stata-press.com/data/r18/idu
(Modified Bangkok IDU Preparatory Study)

. format ltime rtime age_mean %6.2f

. list ltime rtime age_mean male needle inject jail if _n >= 271 & _n <= 274



       ltime   rtime   age_mean   male   needle   inject   jail 

 271.  22.00       .      -6.46    Yes      Yes       No     No 
 272.   3.80    9.41       8.54     No       No       No    Yes 
 273.  20.66       .     -11.46    Yes      Yes       No     No 
 274.   0.00    3.87      -4.46    Yes      Yes      Yes    Yes

We fit a Cox proportional hazards model in which time to HIV infection depends on the above baseline factors. In a single-record-per-subject format, we specify the lower and upper event-time intervals in stintcox's interval() option.

. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime)
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:    Log likelihood = -1086.2564
Iteration 100:  Log likelihood = -597.65634
Iteration 200:  Log likelihood = -597.57555
Iteration 295:  Log likelihood = -597.56443

Computing standard errors: ............................ done

Interval-censored Cox regression                    Number of obs     =  1,124
Baseline hazard: Reduced intervals                         Uncensored =      0
                                                        Left-censored =     41
                                                       Right-censored =    991
                                                       Interval-cens. =     92

                                                    Wald chi2(5)      =  17.10
Log likelihood = -597.56443                         Prob > chi2       = 0.0043



                              OPG                                             
               Haz. ratio   std. err.      z    P>|z|     [95% conf. interval]

    age_mean     .9684341   .0126552    -2.45   0.014     .9439452    .9935582
              
        male  
        Yes      .6846949   .1855907    -1.40   0.162     .4025073    1.164717
              
      needle  
        Yes      1.275912   .2279038     1.36   0.173     .8990401    1.810768
              
      inject  
        Yes      1.250154   .2414221     1.16   0.248     .8562184    1.825334
              
        jail  
        Yes      1.567244   .3473972     2.03   0.043     1.014982    2.419998

Note: Standard-error estimates may be more variable for small datasets and
      datasets with low proportions of interval-censored observations.

We find that age and being in jail at enrollment appear to be associated with, respectively, lower and higher risks of HIV infection.

One way of testing the proportional-hazards assumption for a covariate is to test whether the coefficient associated with that covariate is time invariant. This can be accomplished by including an interaction between this covariate and a function of time in the model and testing whether the corresponding coefficient equals zero.

To test the proportional-hazards assumption in our example, we include all covariates in option tvc() to additionally include their interactions with the analysis time _t, the default, in the model. In this analysis, we suppress the default reporting of hazard ratios with option nohr:

. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime)
     tvc(age_mean i.male i.needle i.inject i.jail) nohr
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:   Log likelihood = -1086.2564
Iteration 100: Log likelihood = -590.53655
Iteration 200: Log likelihood = -590.45163
Iteration 300: Log likelihood = -590.43665
Iteration 340: Log likelihood = -590.43386

Computing standard errors: .............................................. done


Interval-censored Cox regression                    Number of obs     =  1,124
Baseline hazard: Reduced intervals                         Uncensored =      0
                                                        Left-censored =     41
Event-time interval:                                   Right-censored =    991
  Lower endpoint: ltime                                Interval-cens. =     92
  Upper endpoint: rtime
                                                    Wald chi2(10)     =  31.99
Log likelihood = -590.43386                         Prob > chi2       = 0.0004



                              OPG                                             
               Haz. ratio   std. err.      z    P>|z|     [95% conf. interval]

main                                                                          
    age_mean    -.0310177   .0233817    -1.33   0.185     -.076845    .0148097
              
        male  
        Yes     -1.271583   .4604788    -2.76   0.006    -2.174105   -.3690615
              
      needle  
        Yes     -.1819587   .3297493    -0.55   0.581    -.8282554     .464338
              
      inject  
        Yes      .6852961   .3431924     2.00   0.046     .0126513    1.357941
              
        jail  
        Yes      -.529615   .4021087    -1.32   0.188    -1.317734    .2585036

tvc                                                                           
    age_mean     -.000129   .0017099    -0.08   0.940    -.0034804    .0032224
              
        male  
        Yes      .0884102    .042994     2.06   0.040     .0041434    .1726769
              
      needle  
        Yes      .0358545   .0238562     1.50   0.133    -.0109027    .0826118
              
      inject  
        Yes     -.0361192   .0228754    -1.58   0.114    -.0809541    .0087157
              
        jail  
        Yes      .0916036   .0348915     2.63   0.009     .0232176    .1599896

Notes: Standard error estimates may be more variable for small datasets and
       datasets with low proportions of interval-censored observations.
       Variables in tvc equation interacted with _t.

Wald test that [tvc] = 0: chi2(5) = 13.3282               Prob > chi2 = 0.0205

The first equation, main, reports the coefficients for the covariates that do not vary over time; the second equation, tvc, reports the results for covariates interacted with time. And the proprtional-hazards assumption appears to be violated for covariates male and jail according to their p-values reported in the tvc equation. The Wald test at the bottom of the table also indicates that the proportional-hazards assumption doesn't hold globally.

Fitting an interval-censored Cox model using TVCs in multiple-record data

In the previous example, we used stintcox's tvc() option to create TVCs to test the proportional-hazards assumption. In some applications, TVCs already exist in the dataset. Continuing with the data on injecting drug users, being in jail is a TVC. TVCs can be recorded only in multiple-record-per-subject data format. In this format, each subject may contain multiple records with multiple examination times, the event status at each examination time, and potential time-varying covariates at each examination time.

Let's use an extended version, idu2.dta, of the previous idu.dta dataset, which contains all the baseline covariates as well as a time-varying imprisonment indicator variable (jail_vary). jail_vary indicates whether the subject has been imprisioned since the last clinic visit. The dataset also records a subject identifier (id), the examination time of the blood test (time), and whether the blood test is positive at the examination time (is_seropos). Here is a subset of this dataset for subjects 271 through 274 with multiple records per subject:

. use https://www.stata-press.com/data/r18/idu2
(Modified Bangkok IDU Preparatory Study with time-varying variable jail_vary)

. format time age_mean %6.2f

. list id time is_seropos age_mean male needle inject jail_vary
     if id >= 271 & id <= 274, sepby(id) noobs abbreviate(10) 


    
      id    time   is_seropos   age_mean   male   needle   inject   jail_vary 
    
     271    4.89           No      -6.46    Yes      Yes       No          No 
     271    9.31           No      -6.46    Yes      Yes       No          No 
     271   13.38           No      -6.46    Yes      Yes       No         Yes 
     271   17.97           No      -6.46    Yes      Yes       No         Yes 
     271   22.00           No      -6.46    Yes      Yes       No          No 
    
     272    3.80           No       8.54     No       No       No         Yes 
     272    9.41          Yes       8.54     No       No       No          No 
    
     273    3.93           No     -11.46    Yes      Yes       No          No 
     273    8.00           No     -11.46    Yes      Yes       No          No 
     273   12.07           No     -11.46    Yes      Yes       No         Yes 
     273   15.97           No     -11.46    Yes      Yes       No         Yes 
     273   20.66           No     -11.46    Yes      Yes       No         Yes 
    
     274    3.87          Yes      -4.46    Yes      Yes      Yes         Yes

We refit the previous Cox model, but we now use the time-varying variable jail_vary instead of the baseline variable jail. In a multiple-record-per-subject format, we specify the subject identifier in option id(), the examination time in option time(), and the event-indicator status in option status().

. stintcox age_mean i.male i.needle i.inject i.jail_vary, id(id) time(time)   
     status(is_seropos)
note: time-varying covariates detected in the data; using method nearleft to
      impute their values between examination times.
note: using adaptive step size to compute derivatives.

Performing EM optimization (showing every 100 iterations):
Iteration 0:   Log likelihood = -1086.2564
Iteration 100: Log likelihood = -598.45375
Iteration 200: Log likelihood = -598.35872
Iteration 285: Log likelihood = -598.34887

Computing standard errors: ........................................ done

Interval-censored Cox regression                   Number of obs      =  6,453
Baseline hazard: Reduced intervals                 Number of subjects =  1,124
                                                           Uncensored =      0
ID variable: id                                         Left-censored =     41
Examination time: time                                 Right-censored =    991
Status indicator: is_seropos                           Interval-cens. =     92

                                                   Wald chi2(5)       =  17.03
Log likelihood = -598.34887                        Prob > chi2        = 0.0044



                              OPG                                             
        time   Haz. ratio   std. err.      z    P>|z|     [95% conf. interval]

    age_mean     .9714605    .012757    -2.20   0.027     .9467762    .9967884
              
        male  
        Yes      .6678044   .1816576    -1.48   0.138     .3918353    1.138138
              
      needle  
        Yes      1.271409   .2275426     1.34   0.180     .8952546    1.805609
              
      inject  
        Yes      1.370672   .2575405     1.68   0.093     .9484142    1.980928
              
   jail_vary  
        Yes      1.440966   .2916178     1.81   0.071     .9691488    2.142481

Time varying: jail_vary
Note: Standard error estimates may be more variable for small datasets and
      datasets with low proportions of interval-censored observations.

Compared with the previous example, after we account for time-varying imprisonment, the hazard ratio for inject increases from 1.25 to 1.37, but the effect of imprisonment decreases from 1.57 for baseline jail to 1.44 for time-varying jail_vary.

Graphing survivor functions

After fitting the model, we can use stcurve to plot the estimated survivor, failure, hazard, or cumulative hazard function. By default, stcurve evaluates the functions at the overall means of covariates.

. stcurve, survival
note: function evaluated at overall means of covariates.

With multiple-record data with TVCs, we may want to incorporate the time-varying nature of those covariates when plotting the functions. In this case, we can use option attmeans to evaluate the function at time-specific means.

. stcurve, survival attmeans
note: function evaluated at time-specific means of covariates.

We can also use option atframe() to specify our own TVC values to be used to evaluate the survivor function. Suppose we want to plot the survivor curve for an individual with the same covariate pattern as subject 1 in our dataset. We create a new frame called id1 and use frame put to copy the relevant information for subject 1 to this new frame. We list the data we just saved in frame id1.

. frame put time age_mean male needle inject jail_vary if id==1, into(id1)

. frame id1: list



        time   age_mean   male   needle   inject   jail_v~y 

  1.   4.13       5.54    Yes       No      Yes        Yes 
  2.   8.26       5.54    Yes       No      Yes         No 
  3.  12.33       5.54    Yes       No      Yes         No 
  4.  16.10       5.54    Yes       No      Yes         No 
  5.  20.10       5.54    Yes       No      Yes         No 

  6.  24.23       5.54    Yes       No      Yes         No 
  7.  28.62       5.54    Yes       No      Yes         No 
  8.  32.59       5.54    Yes       No      Yes         No 
  9.  36.20       5.54    Yes       No      Yes         No 
 10.  40.56       5.54    Yes       No      Yes        Yes

Then we can graph the survivor curve for this particular profile by typing

. stcurve, survival atframe(id1)
note: function evaluated at specified values of selected covariates and overall
      means of other covariates (if any).
note: covariate values from frame id1 used to evaluate function.

Reference

Zeng, D., L. Mao, and D. Lin. 2016. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103: 253—271.

Tell me more

Learn more about other new features in survival analysis.

View all the new features in Stata 18.

Made for data science.

Get started today.

Order

Upgrade