Home  /  Resources & support  /  FAQs  /  Stata 6: stcox with time-varying covariate
Note: This FAQ is for users of Stata 6. It is not relevant for more recent versions.

Stata 6: How do I estimate a Cox model with a continuously time-varying parameter?

Title   Stata 6: Estimating a Cox model with a continuously time-varying parameter
Author William Gould, StataCorp

This question was originally posed on Statalist and answered by several users and StataCorp’s Bill Gould. Bill's answer, with slight editing, appears below. The thread began when one Stata user was trying to reproduce a Cox regression example on pages 195–197 of Modelling Survival Data in Medical Research by D. Collett (Chapman & Hall 1991).

Below is the response written by Bill Gould, which we quote in full.


Here are a little bit of data in which we want to investigate a continuously time varying Cox-regression.

  . describe
 
 Contains data
   obs:            26                            
  vars:             5                            
  size:           624 (99.9% of memory free)
 -----------------------------------------------------------------------------
    1. patient     float  %9.0g                  
    2. time        float  %9.0g                  survival time (days)
    3. dead        float  %9.0g                  dead
    4. treat       float  %9.0g                  1=single 2=combined
    5. age         float  %9.0g                  age
 -----------------------------------------------------------------------------
 Sorted by:  
      Note:  data has changed since last save
 
  . list in 1/5
 
        patient       time       dead      treat        age  
   1.         1        156          1          1         66  
   2.         2       1040          0          1         38  
   3.         3         59          1          1         72  
   4.         4        421          0          2         53  
   5.         5        329          1          1         43  

and here is what we are to do:

  1. Create new variable c = treat-1.
  2. Estimate a Cox survival model on c, age, and c*time, where time = (time-470).

Finally, the book reports

h(t) = exp{.136*age - .532*c + .003*c*time} * h0(t)

The problem is that this regression includes the (continously varying) time-varying regressor c*time.

This is indeed a tricky problem for Stata. Stata will estimate time-varying models, but Stata estimates models in which the time-varying regressors are assumed to be constant within intervals. Thus the formal answer to our question is that Stata cannot estimate the model. I can, however, come arbitrarily close. I can certainly reproduce these results.

Let’s back up, and let me show you how to do that.

In Stata, when you want to estimate a regression with time-varying covariates, there are to be multiple observations in the dataset per patient. Let us consider the first patient.

           patient       time       dead      treat        age  
      1.         1        156          1          1         66  

As the dataset is right now, this single observation records all the information on the patient. It says that the patient died on day 156. Now pretend the data on patient 1 looked like this,

           patient       time       dead      treat        age  
      1.         1          1          0          1         66  
      2.         1        156          1          1         66

There are now two observations on this patient. The two observations summarize the experiences of the patient over the time intervals (0,1] and (1,156]. Note that I changed dead=0 in the first observation, so the patient did not die at time 1. As I have written it, these two observations for patient 1 record exactly the same information as the single observation did earlier. So do these three observations record the same information,

           patient       time       dead      treat        age  
      1.         1          1          0          1         66  
      2.         1          2          0          1         66  
      3.         1        156          1          1         66

and so do these 156 observations,

           patient       time       dead      treat        age  
      1.         1          1          0          1         66  
      2.         1          2          0          1         66  
      3.         1          3          0          1         66  
      ...
    155.         1        155          0          1         66
    156.         1        156          1          1         66

Note that in the last form I have one observation per day. I could add a new variable to this dataset equal to c*(time-470). In fact, I have done this in Stata. Here is my do-file:

 clear
 infile using cancer 
 label var time "survival time (days)"
 label var dead "dead"
 label var age "age"
 label var treat "1=single 2=combined"
 
 gen c = treat-1
 compress
 
 expand time
 sort patient
 qui by patient: gen t = _n
 gen outcome = 0 
 quietly by patient: replace outcome = dead if _n==_N
 
 stset t, failure(outcome) id(patient)
 
 gen ctime = c*(t-470)
 stcox age c ctime, nohr

Here is an explanation of what I am doing:

    clear
    infile using cancer 
    label var time "survival time (days)"
    label var dead "dead"
    label var age "age"
    label var treat "1=single 2=combined"

I am just reading and labelling the data I showed you earlier. For those who want to duplicate my results, the dictionary cancer.dct can be found below my signature.

    gen c = treat-1
    compress

I create the c variable as instructed and toss in a compress command to make the dataset occupy as little memory as possible. The compress plays no substantive role, and I could omit it.

    expand time

This is the important statement in the do-file. expand time means, out of each observation, make time observations (such as 156 observations) from it. Said differently, I have asked Stata to make time-1 copies of each observation. Thus, after this statement, my first observation (which had time=156) is now 156 identical observations:

    patient       time       dead      treat        age  
          1          1          0          1         66  
          1          1          0          1         66  
          ...
          1          1          0          1         66  

The observations are not one after the other, however. All the new observations have been added after the original 26 observations.

    sort patient

Now the observations are in patient order.

    quietly by patient: gen t = _n

I create a new time variable called t equal to 1, 2, 3, ..., within patient.

    gen outcome = 0 
    quietly by patient: replace outcome = dead if _n==_N

I create a new outcome variable equal to 0 and then, within patient, set its last observation to the value of the original dead variable.

My data is now just how I want it, and I’m ready to analyze it.

    stset t, failure(outcome) id(patient)

I declare my data to be st data with multiple observations per patient.

    gen ctime = c*(t-470)

I create the c*time variable. I could have done this before the stset. The order does not matter.

    stcox age c ctime, nohr

I estimate the regression. The nohr option means no hazard ratios. I prefer to see results as hazard ratios, but the results to which we are comparing are reported as coefficients, so I've asked Stata to display them that way.

Here is the result of running my do-file:

 . do cancer 
  
 . clear

 . infile using cancer 

 dictionary {
        patient
        time
        dead
        treat
        age
 }
 
 (26 observations read)

 . label var time "survival time (days)"

 . label var dead "dead"

 . label var age "age"

 . label var treat "1=single 2=combined"

 . 
 . gen c = treat-1

 . compress
 patient was float now byte
 time was float now int
 dead was float now byte
 treat was float now byte
 age was float now byte
 c was float now byte

 . 
 . expand time
 (15562 observations created)
 
 . sort patient

 . quietly by patient: gen t = _n

 . gen outcome = 0 

 . qui by patient: replace outcome = dead if _n==_N

 . 
 . stset t, failure(outcome) id(patient)

 !--  note:  making entry-time variable t0
        (within patient, t0 will be 0 for the 1st observation and the
        lagged value of exit time t thereafter)
 
               id:  patient
       entry time:  t0
        exit time:  t
   failure/censor:  outcome -->
 (output omitted)

 . 
 . gen ctime = c*(t-470)
 
 . stcox age c ctime, nohr

     failure time:  t
       entry time:  t0
   failure/censor:  outcome
               id:  patient

 Iteration 0:  log likelihood = -34.98494
 Iteration 1:  log likelihood =-27.377424
 Iteration 2:  log likelihood =-26.829243
 Iteration 3:  log likelihood =-26.819411
 Iteration 4:  log likelihood =-26.819386
 Refining estimates:
 Iteration 0:  log likelihood =-26.819386
 
 Cox regression -- Breslow method for ties

 No. of subjects =           26                    Number of obs   =       403
 No. of failures =           12                
 Time at risk    =        15588                   
                                                   LR chi2(3)      =     16.33
 Log likelihood  =   -26.819386                    Prob > chi2     =    0.0010
 
 -----------------------------------------------------------------------------
        t |
  outcome |      Coef.   Std. Err.       z     P>|z|      [95% Conf. Interval]
 ---------+-------------------------------------------------------------------
      age |    .136387   .0463484      2.943   0.003      .0455457    .2272282
        c |  -.5319678   .7687557     -0.692   0.489     -2.038701    .9747657
    ctime |   .0034117     .00513      0.665   0.506     -.0066429    .0134663
 -----------------------------------------------------------------------------

This is the result Collett reports.

Understand what I did. Collett wanted the continuous variable c*(time-470), and I substituted a step function for it:

           c*(time-470), c=1
    |            /
    |           /  |  <-- my step function
    |          /   |
    |         /----+
    |        /| 
    |       / |
    |      /  |
    |     /   |
    |    /----+
    |   /|
    |  / |
    | /  |
    |/   |
    +-------------------------------- time
         1    2    3

That is what I meant when I said I could come arbitrarily close to the Collett’s result—it is just a matter of how finely I mesh my steps.

— Bill
[email protected]

Here is the dictionary containing the data used above:

 dictionary {
  patient
  time
  dead
  treat
  age
 }
 1 156 1 1 66
 2 1040 0 1 38
 3 59 1 1 72
 4 421 0 2 53
 5 329 1 1 43
 6 769 0 2 59
 7 365 1 2 64
 8 770 0 2 57 
 9 1227 0 2 59
 10 268 1 1 74
 11 475 1 2 59
 12 1129 0 2 53
 13 464 1 2 56
 14 1206 0 2 44
 15 638 1 1 56
 16 563 1 2 55
 17 1106 0 1 44
 18 431 1 1 50 
 19 855 0 1 43
 20 803 0 1 39 
 21 115 1 1 74
 22 744 0 2 50 
 23 477 0 1 64 
 24 448 0 1 56
 25 353 1 2 63 
 26 377 0 2 58