|
Note: This FAQ is for users of Stata 6, an older version of Stata.
It is not relevant for more recent versions.
Stata 6: How do I estimate a Cox model with a continuously time-varying parameter?
|
Title
|
|
Stata 6: Estimating a Cox model with a continuously time-varying parameter
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
January 1999
|
This question was originally posed on
Statalist and answered by
several users and StataCorp’s Bill Gould. Bill's answer, with slight
editing, appears below. The thread
began when one Stata user was trying to reproduce a Cox regression example
on pages 195–197 of Modelling Survival Data in Medical Research by
D. Collett (Chapman & Hall 1991).
Below is the response written by Bill Gould, which we quote in full.
Here are a little bit of data in which we want to investigate a continuously
time varying Cox-regression.
. describe
Contains data
obs: 26
vars: 5
size: 624 (99.9% of memory free)
-----------------------------------------------------------------------------
1. patient float %9.0g
2. time float %9.0g survival time (days)
3. dead float %9.0g dead
4. treat float %9.0g 1=single 2=combined
5. age float %9.0g age
-----------------------------------------------------------------------------
Sorted by:
Note: data has changed since last save
. list in 1/5
patient time dead treat age
1. 1 156 1 1 66
2. 2 1040 0 1 38
3. 3 59 1 1 72
4. 4 421 0 2 53
5. 5 329 1 1 43
and here is what we are to do:
- Create new variable c = treat-1.
- Estimate a Cox survival model on c, age, and c*time,
where time = (time-470).
Finally, the book reports
-
h(t) = exp{.136*age - .532*c + .003*c*time} * h0(t)
The problem is that this regression includes the (continously varying)
time-varying regressor c*time.
This is indeed a tricky problem for Stata. Stata will estimate
time-varying models, but Stata estimates models in which the time-varying
regressors are assumed to be constant within intervals. Thus the formal
answer to our question is that Stata cannot estimate the model. I can,
however, come arbitrarily close. I can certainly reproduce these results.
Let’s back up, and let me show you how to do that.
In Stata, when you want to estimate a regression with time-varying
covariates, there are to be multiple observations in the dataset per
patient. Let us consider the first patient.
patient time dead treat age
1. 1 156 1 1 66
As the dataset is right now, this single observation records all the
information on the patient. It says that the patient died on day 156. Now
pretend the data on patient 1 looked like this,
patient time dead treat age
1. 1 1 0 1 66
2. 1 156 1 1 66
There are now two observations on this patient. The two observations
summarize the experiences of the patient over the time intervals (0,1] and
(1,156]. Note that I changed dead=0 in the first observation, so the
patient did not die at time 1. As I have written it, these two observations
for patient 1 record exactly the same information as the single observation
did earlier. So do these three observations record the same information,
patient time dead treat age
1. 1 1 0 1 66
2. 1 2 0 1 66
3. 1 156 1 1 66
and so do these 156 observations,
patient time dead treat age
1. 1 1 0 1 66
2. 1 2 0 1 66
3. 1 3 0 1 66
...
155. 1 155 0 1 66
156. 1 156 1 1 66
Note that in the last form I have one observation per day. I could add a
new variable to this dataset equal to c*(time-470). In fact, I have
done this in Stata. Here is my do-file:
clear
infile using cancer
label var time "survival time (days)"
label var dead "dead"
label var age "age"
label var treat "1=single 2=combined"
gen c = treat-1
compress
expand time
sort patient
qui by patient: gen t = _n
gen outcome = 0
quietly by patient: replace outcome = dead if _n==_N
stset t, failure(outcome) id(patient)
gen ctime = c*(t-470)
stcox age c ctime, nohr
Here is an explanation of what I am doing:
clear
infile using cancer
label var time "survival time (days)"
label var dead "dead"
label var age "age"
label var treat "1=single 2=combined"
I am just reading and labelling the data I showed you earlier. For
those who want to duplicate my results, the dictionary
cancer.dct can be found below my signature.
gen c = treat-1
compress
I create the c variable as instructed and toss in a
compress
command to make the dataset occupy as little memory as possible.
The compress plays no substantive role, and I could omit it.
expand time
This is the important statement in the do-file. expand time means,
out of each observation, make time observations (such as 156
observations) from it. Said differently, I have asked Stata to make
time-1 copies of each observation. Thus, after this statement, my
first observation (which had time=156) is now 156 identical
observations:
patient time dead treat age
1 1 0 1 66
1 1 0 1 66
...
1 1 0 1 66
The observations are not one after the other, however. All the new
observations have been added after the original 26 observations.
sort patient
Now the observations are in patient order.
quietly by patient: gen t = _n
I create a new time variable called t equal to 1, 2, 3, ..., within
patient.
gen outcome = 0
quietly by patient: replace outcome = dead if _n==_N
I create a new outcome variable equal to 0 and then, within patient,
set its last observation to the value of the original dead variable.
My data is now just how I want it, and I’m ready to analyze it.
stset t, failure(outcome) id(patient)
I declare my data to be st data with multiple observations per
patient.
gen ctime = c*(t-470)
I create the c*time variable. I could have done this before the
stset.
The order does not matter.
stcox age c ctime, nohr
I estimate the regression. The nohr option means no hazard
ratios. I prefer to see results as hazard ratios, but the results
to which we are comparing are reported as coefficients, so I've
asked Stata to display them that way.
Here is the result of running my do-file:
. do cancer
. clear
. infile using cancer
dictionary {
patient
time
dead
treat
age
}
(26 observations read)
. label var time "survival time (days)"
. label var dead "dead"
. label var age "age"
. label var treat "1=single 2=combined"
.
. gen c = treat-1
. compress
patient was float now byte
time was float now int
dead was float now byte
treat was float now byte
age was float now byte
c was float now byte
.
. expand time
(15562 observations created)
. sort patient
. quietly by patient: gen t = _n
. gen outcome = 0
. qui by patient: replace outcome = dead if _n==_N
.
. stset t, failure(outcome) id(patient)
!-- note: making entry-time variable t0
(within patient, t0 will be 0 for the 1st observation and the
lagged value of exit time t thereafter)
id: patient
entry time: t0
exit time: t
failure/censor: outcome -->
(output omitted)
.
. gen ctime = c*(t-470)
. stcox age c ctime, nohr
failure time: t
entry time: t0
failure/censor: outcome
id: patient
Iteration 0: log likelihood = -34.98494
Iteration 1: log likelihood =-27.377424
Iteration 2: log likelihood =-26.829243
Iteration 3: log likelihood =-26.819411
Iteration 4: log likelihood =-26.819386
Refining estimates:
Iteration 0: log likelihood =-26.819386
Cox regression -- Breslow method for ties
No. of subjects = 26 Number of obs = 403
No. of failures = 12
Time at risk = 15588
LR chi2(3) = 16.33
Log likelihood = -26.819386 Prob > chi2 = 0.0010
-----------------------------------------------------------------------------
t |
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+-------------------------------------------------------------------
age | .136387 .0463484 2.943 0.003 .0455457 .2272282
c | -.5319678 .7687557 -0.692 0.489 -2.038701 .9747657
ctime | .0034117 .00513 0.665 0.506 -.0066429 .0134663
-----------------------------------------------------------------------------
This is the result Collett reports.
Understand what I did. Collett wanted the continuous variable
c*(time-470), and I substituted a step function for it:
c*(time-470), c=1
| /
| / | <-- my step function
| / |
| /----+
| /|
| / |
| / |
| / |
| /----+
| /|
| / |
| / |
|/ |
+-------------------------------- time
1 2 3
That is what I meant when I said I could come arbitrarily close to the
Collett’s result—it is just a matter of how finely I mesh my steps.
— Bill
wgould@stata.com
Here is the dictionary containing the data used above:
dictionary {
patient
time
dead
treat
age
}
1 156 1 1 66
2 1040 0 1 38
3 59 1 1 72
4 421 0 2 53
5 329 1 1 43
6 769 0 2 59
7 365 1 2 64
8 770 0 2 57
9 1227 0 2 59
10 268 1 1 74
11 475 1 2 59
12 1129 0 2 53
13 464 1 2 56
14 1206 0 2 44
15 638 1 1 56
16 563 1 2 55
17 1106 0 1 44
18 431 1 1 50
19 855 0 1 43
20 803 0 1 39
21 115 1 1 74
22 744 0 2 50
23 477 0 1 64
24 448 0 1 56
25 353 1 2 63
26 377 0 2 58
|