# Re: st: clustering in proportional hazards models with stata/mp 10.0

 From "Daniel O. Koralek" To statalist@hsphsun2.harvard.edu Subject Re: st: clustering in proportional hazards models with stata/mp 10.0 Date Fri, 7 Sep 2007 14:50:21 -0400

Bill,

Thank you for your quite thorough response to my query. One quick question. Stata will allow me to combine your solutions 1 and 3. And it does (at least things move in the directions I would expect) what I would expect (narrower confidence intervals for the estimates derived using solution 3. But something about it doesn't seem valid. Any thoughts?

Thanks,

Dan

Daniel O. Koralek
Department of Epidemiology/Lineberger Comprehensive Cancer Center
The University of North Carolina at Chapel Hill
Chapel Hill, NC 27599-7435
http://www.unc.edu/~dkoralek/

On 7 Sep, 2007, at 02:33 , statalist-digest wrote:

Date: Thu, 06 Sep 2007 09:38:24 -0500
From: wgould@stata.com (William Gould, StataCorp LP)
Subject: Re: st: clustering in proportional hazards models with stata/mp 10.0

Daniel Koralek <dkoralek@unc.edu> writes about using -stcox- on individual
data where each individual was recruited from one of ten centers. He is
concerned that which center may influence survival because "different foods
eaten in different regions may influence nutrients".

He considers three ways of dealing with this problem,

. stcox ..., vce(cluster center) (1)

. xi: stcox ... i.center (2)

. stcox ..., stratify(center) (3)

and, of course, he could ignore center altogether

. stcox ... [center completely omitted] (0)

As a matter of notation, let's assume the other covariates in the
models (the ... part) are x1 and x2.

Re solution (0):

This solution assumes center has no effect and Daniel has already
raised concerns that it does, so the solution is inappropriate.

Re solution (1):

This solution also assumes center has no effect; it instead
conservatively handles the situation where the individual patients
are overly homogeneous, which is to say, not independent draws.
Actually, I didn't say that exactly right for the Cox model, but
what I said implies what what I should have said, which is that
selection of the failures from the risk pools at each failure time
are not independent.

Daniel tried solution (1) and found that the standard errors changed,
but the reported coefficients did not. Exactly. Under solution (1),
because center has no effect, the coefficients estimated the standard
way are fine, although perhaps inefficient. The lack of independence,
however, means standard errors usually will be understated and
-vce(cluster center)- handles that.

Re solution (2):

This solution assumes that center does have a direct effect on
survival, and it constrains the effect to be a multiplicative
shift in the the baseline hazard function. The baseline hazard
function ho(t) is a function of time, such as

ho(t)
| .
| . . .
|. . .
| . .
| . .
|
+------------------- time

FYI, the baseline survival function So(t) is the integral of
ho(t), negated and exponentiated. There's nothing deep there;
that's just the mathematical formula for calculating one one
from the other. I switchd to hazard functions, however,
because the hazard function is the natural metric for the Cox model.
The hazard rate for a particular individual in the data at a particular
time is just ho(t)*exp(X_i*b), where X_i are the individual's covariates
at time t. That's why I said solution (2) constrains each center's
effect to be a multiplicative shift of ho(t).

Concerning our use of dummy variables for the centers,
we would like to think that we chose this particular functional form
because it is truly representative of how the different
foods served in the different centers influence the hazard, but
the fact is that we choose this functional form because it is
convenient; the effect of each center is wrapped up in just a
single coefficient.

This is not a bad approach.

Re solution (2.5):

Alright, I admit that Daniel did not include a solution (2.5), but
I want to add it; it will help to understand solution (2), and
is often useful in and of itself.

Solution (2) was

. xi: stcox ... i.center (2)

Solution 2.5 is

. xi: stcox ... i.center i.center*x1 (2.5)

In this solution, we assume that center does not merely shift
the hazard function in a multiplicative way, we assume that
center modifies the effect of x1.

Actually, there are a lot of solution (2.5)'s. I could have chosen
x2 rather than x1,

. xi: stcox ... i.center i.center*x2

or even x1 and x2,

. xi: stcox ... i.center i.center*x1 i.center*x2

Anyway, in this modeling-based approach, we need to think carefully
about how the different foods served in the centers effects the shifting
of the baseline hazard function. Is it just a shift (solution 2),
or do the different foods modify the effect x1 (solution 2.5), or
something else?

We also need to appreciate that we are assuming the SHAPE of the
survivor function is the same across all centers and that we are
just moving it up and down, multiplicatively.

Re solution (3):

In this solution, we let the baseline hazard be different for each
center. That is, rather than assuming the baseline function is

ho(t)
| .
| . . .
|. . .
| . .
| . .
|
+------------------- time

for all centers, albeit shifted, we assume that above picture might
be the baseline function for center 1, and for center 2, the function
could be completely different:

ho(t)
| . . .
| . .
|. . .
| . .
| . . .
|
+------------------- time

and it could be different again for each of the other centers.

I should emphasize that we do not actually assume the shape --
the data determine that -- we just ALLOW the shape to be different
in this solution. In the previous solution, we CONSTRAINED the
shape to be the same across centers, but what that single shape was
was determined by the data.

Anyway, this new solution seems wonderful because, what could more
flexible?

In this solution, however, we constrain the effects of x1 and x2
to be the same across centers. If the estimated hazard ratio
for X1 is 1.5, we are saying that that each center's hazard
function -- yes, they are different -- is multiplied by THE SAME
1.5 for each unit increase in X1. The multiplicative shift is the
same, but the the underlying hazard functions are different.

Re solution 3.5:

Daniel didn't mention this solution, but what if he combined
solution (3) with solution (2), which would be

. xi: stcox ... i.center, stratify(center)

Answer: nothing new; the result is just solution (3).
-stratify(center)- already allows the baseline hazards to be
different, and that includes multiplicative shifts.
i.center would try to estimate a unique shift to apply to each
unique baseline hazard for each center, and mechanically, that
will not work because for any value of the shift, there is a
corresponding baseline hazard that, when you combine the results,
yields the same final result. Try this, and -stcox- will iterate
forever.

Nonetheless, there is a variation on the above that will work.
One example is

. xi: stcox ... i.center*x1, stratify(center)

There is no i.center in the above -- stratify(center) handles that --
but we do allow the effect of x1 to be different across the
centers.

Given sufficient data, Daniel could do this if he thought center
affected both the shape of the baseline hazard function and
affected the effect of x1.

Final comment
- -------------

Daniel must now choose, and he needs to base his choice on his science,
judgment guided by experience, and whatever else he has that will inform him
as to how the process that generates failures might reasonably work.

Daniel might object that he wants to make the minimum number of assumptions
necessary. In that case, I would reccomend solution (3.5), but I
warn him, he may not have a sufficient amount of data for it. Given
sufficient data, Daniel could start with a solution (3.5) model and then work
backwards, putting constraints on it that appear reasonable, the purpose being
to simplify interpretation.

Usually, however, we are not so lucky as to have sufficient data to do
that, and then we must think hard about what is a reasonable starting place,
and go at it from there. A reasonable starting place might well be
the dummy-variable shifts of solution (2) and about which Daniel was
so dismissive. Identifying shifts is a lot like measuring averages.
It doesn't give you the richness of detail of more complete models, but
it can be a good starting point for identifying what is going on.

One more warning about solution (3.5): It not a panacea. It, too, makes
assumptions such as multipicative effects on hazard functions and that
the functional form chosen by Daniel is correct. Given even more data,
we could explore the validity of those assumptions, too. My point is that
after solution 3.5, there are solutions 4, 5, 6, and on and on, each making
fewer and fewer assumptions, and each requiring more and more data.

- -- Bill
wgould@stata.com

```*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```