»  Home »  Resources & support »  FAQs »  Stata 6: stcox with time-varying covariate
Note: This FAQ is for users of Stata 6. It is not relevant for more recent versions.

## Stata 6: How do I estimate a Cox model with a continuously time-varying parameter?

 Title Stata 6: Estimating a Cox model with a continuously time-varying parameter Author William Gould, StataCorp

This question was originally posed on Statalist and answered by several users and StataCorp’s Bill Gould. Bill's answer, with slight editing, appears below. The thread began when one Stata user was trying to reproduce a Cox regression example on pages 195–197 of Modelling Survival Data in Medical Research by D. Collett (Chapman & Hall 1991).

Below is the response written by Bill Gould, which we quote in full.

Here are a little bit of data in which we want to investigate a continuously time varying Cox-regression.

  . describe

Contains data
obs:            26
vars:             5
size:           624 (99.9% of memory free)
-----------------------------------------------------------------------------
1. patient     float  %9.0g
2. time        float  %9.0g                  survival time (days)
4. treat       float  %9.0g                  1=single 2=combined
5. age         float  %9.0g                  age
-----------------------------------------------------------------------------
Sorted by:
Note:  data has changed since last save

. list in 1/5

1.         1        156          1          1         66
2.         2       1040          0          1         38
3.         3         59          1          1         72
4.         4        421          0          2         53
5.         5        329          1          1         43


and here is what we are to do:

1. Create new variable c = treat-1.
2. Estimate a Cox survival model on c, age, and c*time, where time = (time-470).

Finally, the book reports

h(t) = exp{.136*age - .532*c + .003*c*time} * h0(t)

The problem is that this regression includes the (continously varying) time-varying regressor c*time.

This is indeed a tricky problem for Stata. Stata will estimate time-varying models, but Stata estimates models in which the time-varying regressors are assumed to be constant within intervals. Thus the formal answer to our question is that Stata cannot estimate the model. I can, however, come arbitrarily close. I can certainly reproduce these results.

Let’s back up, and let me show you how to do that.

In Stata, when you want to estimate a regression with time-varying covariates, there are to be multiple observations in the dataset per patient. Let us consider the first patient.

           patient       time       dead      treat        age
1.         1        156          1          1         66


As the dataset is right now, this single observation records all the information on the patient. It says that the patient died on day 156. Now pretend the data on patient 1 looked like this,

           patient       time       dead      treat        age
1.         1          1          0          1         66
2.         1        156          1          1         66


There are now two observations on this patient. The two observations summarize the experiences of the patient over the time intervals (0,1] and (1,156]. Note that I changed dead=0 in the first observation, so the patient did not die at time 1. As I have written it, these two observations for patient 1 record exactly the same information as the single observation did earlier. So do these three observations record the same information,

           patient       time       dead      treat        age
1.         1          1          0          1         66
2.         1          2          0          1         66
3.         1        156          1          1         66


and so do these 156 observations,

           patient       time       dead      treat        age
1.         1          1          0          1         66
2.         1          2          0          1         66
3.         1          3          0          1         66
...
155.         1        155          0          1         66
156.         1        156          1          1         66


Note that in the last form I have one observation per day. I could add a new variable to this dataset equal to c*(time-470). In fact, I have done this in Stata. Here is my do-file:

 clear
infile using cancer
label var time "survival time (days)"
label var age "age"
label var treat "1=single 2=combined"

gen c = treat-1
compress

expand time
sort patient
qui by patient: gen t = _n
gen outcome = 0
quietly by patient: replace outcome = dead if _n==_N

stset t, failure(outcome) id(patient)

gen ctime = c*(t-470)
stcox age c ctime, nohr


Here is an explanation of what I am doing:

    clear
infile using cancer
label var time "survival time (days)"
label var age "age"
label var treat "1=single 2=combined"


I am just reading and labelling the data I showed you earlier. For those who want to duplicate my results, the dictionary cancer.dct can be found below my signature.

    gen c = treat-1
compress


I create the c variable as instructed and toss in a compress command to make the dataset occupy as little memory as possible. The compress plays no substantive role, and I could omit it.

    expand time


This is the important statement in the do-file. expand time means, out of each observation, make time observations (such as 156 observations) from it. Said differently, I have asked Stata to make time-1 copies of each observation. Thus, after this statement, my first observation (which had time=156) is now 156 identical observations:

    patient       time       dead      treat        age
1          1          0          1         66
1          1          0          1         66
...
1          1          0          1         66


The observations are not one after the other, however. All the new observations have been added after the original 26 observations.

    sort patient


Now the observations are in patient order.

    quietly by patient: gen t = _n


I create a new time variable called t equal to 1, 2, 3, ..., within patient.

    gen outcome = 0
quietly by patient: replace outcome = dead if _n==_N


I create a new outcome variable equal to 0 and then, within patient, set its last observation to the value of the original dead variable.

My data is now just how I want it, and I’m ready to analyze it.

    stset t, failure(outcome) id(patient)


I declare my data to be st data with multiple observations per patient.

    gen ctime = c*(t-470)


I create the c*time variable. I could have done this before the stset. The order does not matter.

    stcox age c ctime, nohr


I estimate the regression. The nohr option means no hazard ratios. I prefer to see results as hazard ratios, but the results to which we are comparing are reported as coefficients, so I've asked Stata to display them that way.

Here is the result of running my do-file:

 . do cancer

. clear

. infile using cancer

dictionary {
patient
time
treat
age
}

. label var time "survival time (days)"

. label var age "age"

. label var treat "1=single 2=combined"

.
. gen c = treat-1

. compress
patient was float now byte
time was float now int
treat was float now byte
age was float now byte
c was float now byte

.
. expand time
(15562 observations created)

. sort patient

. quietly by patient: gen t = _n

. gen outcome = 0

. qui by patient: replace outcome = dead if _n==_N

.
. stset t, failure(outcome) id(patient)

!--  note:  making entry-time variable t0
(within patient, t0 will be 0 for the 1st observation and the
lagged value of exit time t thereafter)

id:  patient
entry time:  t0
exit time:  t
failure/censor:  outcome -->
(output omitted)

.
. gen ctime = c*(t-470)

. stcox age c ctime, nohr

failure time:  t
entry time:  t0
failure/censor:  outcome
id:  patient

Iteration 0:  log likelihood = -34.98494
Iteration 1:  log likelihood =-27.377424
Iteration 2:  log likelihood =-26.829243
Iteration 3:  log likelihood =-26.819411
Iteration 4:  log likelihood =-26.819386
Refining estimates:
Iteration 0:  log likelihood =-26.819386

Cox regression -- Breslow method for ties

No. of subjects =           26                    Number of obs   =       403
No. of failures =           12
Time at risk    =        15588
LR chi2(3)      =     16.33
Log likelihood  =   -26.819386                    Prob > chi2     =    0.0010

-----------------------------------------------------------------------------
t |
outcome |      Coef.   Std. Err.       z     P>|z|      [95% Conf. Interval]
---------+-------------------------------------------------------------------
age |    .136387   .0463484      2.943   0.003      .0455457    .2272282
c |  -.5319678   .7687557     -0.692   0.489     -2.038701    .9747657
ctime |   .0034117     .00513      0.665   0.506     -.0066429    .0134663
-----------------------------------------------------------------------------


This is the result Collett reports.

Understand what I did. Collett wanted the continuous variable c*(time-470), and I substituted a step function for it:

           c*(time-470), c=1
|            /
|           /  |  <-- my step function
|          /   |
|         /----+
|        /|
|       / |
|      /  |
|     /   |
|    /----+
|   /|
|  / |
| /  |
|/   |
+-------------------------------- time
1    2    3


That is what I meant when I said I could come arbitrarily close to the Collett’s result—it is just a matter of how finely I mesh my steps.

— Bill
wgould@stata.com

Here is the dictionary containing the data used above:

 dictionary {
patient
time
treat
age
}
1 156 1 1 66
2 1040 0 1 38
3 59 1 1 72
4 421 0 2 53
5 329 1 1 43
6 769 0 2 59
7 365 1 2 64
8 770 0 2 57
9 1227 0 2 59
10 268 1 1 74
11 475 1 2 59
12 1129 0 2 53
13 464 1 2 56
14 1206 0 2 44
15 638 1 1 56
16 563 1 2 55
17 1106 0 1 44
18 431 1 1 50
19 855 0 1 43
20 803 0 1 39
21 115 1 1 74
22 744 0 2 50
23 477 0 1 64
24 448 0 1 56
25 353 1 2 63
26 377 0 2 58