Note: This FAQ is for users of Stata 6, an older version of Stata.
It is not relevant for more recent versions.

Title | Stata 6: Estimating a Cox model with a continuously time-varying parameter | |

Author | William Gould, StataCorp |

This question was originally posed on
Statalist and answered by
several users and StataCorp’s Bill Gould. Bill's answer, with slight
editing, appears below. The thread
began when one Stata user was trying to reproduce a Cox regression example
on pages 195–197 of *Modelling Survival Data in Medical Research* by
D. Collett (Chapman & Hall 1991).

Below is the response written by Bill Gould, which we quote in full.

Here are a little bit of data in which we want to investigate a continuously time varying Cox-regression.

. describeContains data obs: 26 vars: 5 size: 624 (99.9% of memory free) ----------------------------------------------------------------------------- 1. patient float %9.0g 2. time float %9.0g survival time (days) 3. dead float %9.0g dead 4. treat float %9.0g 1=single 2=combined 5. age float %9.0g age ----------------------------------------------------------------------------- Sorted by: Note: data has changed since last save. list in 1/5patient time dead treat age 1. 1 156 1 1 66 2. 2 1040 0 1 38 3. 3 59 1 1 72 4. 4 421 0 2 53 5. 5 329 1 1 43

and here is what we are to do:

- Create new variable
**c = treat-1**. - Estimate a Cox survival model on
**c**,**age**, and**c*time**, where**time = (time-470)**.

Finally, the book reports

**h(t) = exp{.136*age - .532*c + .003*c*time} * h0(t)**

The problem is that this regression includes the (continously varying)
time-varying regressor **c*time**.

This is indeed a tricky problem for Stata. Stata will estimate time-varying models, but Stata estimates models in which the time-varying regressors are assumed to be constant within intervals. Thus the formal answer to our question is that Stata cannot estimate the model. I can, however, come arbitrarily close. I can certainly reproduce these results.

Let’s back up, and let me show you how to do that.

In Stata, when you want to estimate a regression with time-varying covariates, there are to be multiple observations in the dataset per patient. Let us consider the first patient.

patient time dead treat age 1. 1 156 1 1 66

As the dataset is right now, this single observation records all the information on the patient. It says that the patient died on day 156. Now pretend the data on patient 1 looked like this,

patient time dead treat age 1. 1 1 0 1 66 2. 1 156 1 1 66

There are now two observations on this patient. The two observations
summarize the experiences of the patient over the time intervals (0,1] and
(1,156]. Note that I changed **dead=0** in the first observation, so the
patient did not die at time 1. As I have written it, these two observations
for patient 1 record exactly the same information as the single observation
did earlier. So do these three observations record the same information,

patient time dead treat age 1. 1 1 0 1 66 2. 1 2 0 1 66 3. 1 156 1 1 66

and so do these 156 observations,

patient time dead treat age 1. 1 1 0 1 66 2. 1 2 0 1 66 3. 1 3 0 1 66 ... 155. 1 155 0 1 66 156. 1 156 1 1 66

Note that in the last form I have one observation per day. I could add a
new variable to this dataset equal to **c*(time-470)**. In fact, I have
done this in Stata. Here is my do-file:

clear infile using cancer label var time "survival time (days)" label var dead "dead" label var age "age" label var treat "1=single 2=combined" gen c = treat-1 compress expand time sort patient qui by patient: gen t = _n gen outcome = 0 quietly by patient: replace outcome = dead if _n==_N stset t, failure(outcome) id(patient) gen ctime = c*(t-470) stcox age c ctime, nohr

Here is an explanation of what I am doing:

clear infile using cancer label var time "survival time (days)" label var dead "dead" label var age "age" label var treat "1=single 2=combined"

I am just reading and labelling the data I showed you earlier. For those who want to duplicate my results, the dictionary cancer.dct can be found below my signature.

gen c = treat-1 compress

I create the c variable as instructed and toss in a
**compress**
command to make the dataset occupy as little memory as possible.
The **compress** plays no substantive role, and I could omit it.

expand time

This is the important statement in the do-file. **expand time** means,
out of each observation, make time observations (such as 156
observations) from it. Said differently, I have asked Stata to make
time-1 copies of each observation. Thus, after this statement, my
first observation (which had time=156) is now 156 identical
observations:

patient time dead treat age 1 1 0 1 66 1 1 0 1 66 ... 1 1 0 1 66

The observations are not one after the other, however. All the new observations have been added after the original 26 observations.

sort patient

Now the observations are in patient order.

quietly by patient: gen t = _n

I create a new time variable called t equal to 1, 2, 3, ..., within patient.

gen outcome = 0 quietly by patient: replace outcome = dead if _n==_N

I create a new outcome variable equal to 0 and then, within patient, set its last observation to the value of the original dead variable.

My data is now just how I want it, and I’m ready to analyze it.

stset t, failure(outcome) id(patient)

I declare my data to be st data with multiple observations per patient.

gen ctime = c*(t-470)

I create the c*time variable. I could have done this before the
**stset**.
The order does not matter.

stcox age c ctime, nohr

I estimate the regression. The **nohr** option means no hazard
ratios. I prefer to see results as hazard ratios, but the results
to which we are comparing are reported as coefficients, so I've
asked Stata to display them that way.

Here is the result of running my do-file:

. do cancer . clear . infile using cancerdictionary { patient time dead treat age } (26 observations read). label var time "survival time (days)" . label var dead "dead" . label var age "age" . label var treat "1=single 2=combined" . . gen c = treat-1 . compresspatient was float now byte time was float now int dead was float now byte treat was float now byte age was float now byte c was float now byte. . expand time(15562 observations created). sort patient . quietly by patient: gen t = _n . gen outcome = 0 . qui by patient: replace outcome = dead if _n==_N . . stset t, failure(outcome) id(patient)!-- note: making entry-time variable t0 (within patient, t0 will be 0 for the 1st observation and the lagged value of exit time t thereafter) id: patient entry time: t0 exit time: t failure/censor: outcome --> (output omitted). . gen ctime = c*(t-470) . stcox age c ctime, nohrfailure time: t entry time: t0 failure/censor: outcome id: patient Iteration 0: log likelihood = -34.98494 Iteration 1: log likelihood =-27.377424 Iteration 2: log likelihood =-26.829243 Iteration 3: log likelihood =-26.819411 Iteration 4: log likelihood =-26.819386 Refining estimates: Iteration 0: log likelihood =-26.819386 Cox regression -- Breslow method for ties No. of subjects = 26 Number of obs = 403 No. of failures = 12 Time at risk = 15588 LR chi2(3) = 16.33 Log likelihood = -26.819386 Prob > chi2 = 0.0010 ----------------------------------------------------------------------------- t | outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+------------------------------------------------------------------- age | .136387 .0463484 2.943 0.003 .0455457 .2272282 c | -.5319678 .7687557 -0.692 0.489 -2.038701 .9747657 ctime | .0034117 .00513 0.665 0.506 -.0066429 .0134663 -----------------------------------------------------------------------------

This is the result Collett reports.

Understand what I did. Collett wanted the continuous variable
**c*(time-470)**, and I substituted a step function for it:

c*(time-470), c=1 | / | / | <-- my step function | / | | /----+ | /| | / | | / | | / | | /----+ | /| | / | | / | |/ | +-------------------------------- time 1 2 3

That is what I meant when I said I could come arbitrarily close to the Collett’s result—it is just a matter of how finely I mesh my steps.

— Bill

wgould@stata.com

Here is the dictionary containing the data used above:

dictionary { patient time dead treat age } 1 156 1 1 66 2 1040 0 1 38 3 59 1 1 72 4 421 0 2 53 5 329 1 1 43 6 769 0 2 59 7 365 1 2 64 8 770 0 2 57 9 1227 0 2 59 10 268 1 1 74 11 475 1 2 59 12 1129 0 2 53 13 464 1 2 56 14 1206 0 2 44 15 638 1 1 56 16 563 1 2 55 17 1106 0 1 44 18 431 1 1 50 19 855 0 1 43 20 803 0 1 39 21 115 1 1 74 22 744 0 2 50 23 477 0 1 64 24 448 0 1 56 25 353 1 2 63 26 377 0 2 58