Hello,
I've got a question regarding the use of Cox regression with explanators
that vary monotonically with analysis time. I am using the National
Longitudinal Survey of Youth to look at the effect of having a prison
record on one's hazard of first marriage. If possible, I would like
conduct some analyses for a sample that is limited to observations that are
incarcerated at some point during the panel. But I'm getting some very
strange results when I do so. Here are the results for the entire sample
(I cleaned them up a bit to make them a little more readable). The
independent variable of interest is called "everjail," and is set equal to
zero for person-years who have not yet gone to jail and to one for
observations that have been incarcerated. Analysis time is measured here
in terms of months of age. The coefficient on "everjail" is LT one and
statistically significant. This result persists when I use a number of
different combinations of control variables, when I limit my sample to
self-reported juvenile delinquents, and in a number of other settings:
#delimit;
capture drop sch*;
capture drop sca*;
xi: stcox newern cumern jail everjail alcoholicparent badparent
i.stateres
delinquent1 south urate rural everkids i.year AFQT i.edcatrev relighome
if race == 2 & varuse != . & sampid < 15 & sampid != 9,
robust schoenfeld(sch*) scaledsch(sca*);
stphtest, detail;
Cox regression -- Breslow method for ties
No. of subjects = 196630900 Number of obs =
12493
No. of failures = 85969633
Time at risk = 1937511024
Wald chi2(69) =
.
Log pseudolikelihood = -3515.0499 Prob > chi2 =
.
(Std. Err. adjusted for 1281 clusters in
caseid)
------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
newern | 1.00058 .0000778 7.45 0.000 1.000427
1.000732
cumern | 1.020076 .0021842 9.28 0.000 1.015804
1.024366
jail | .40921 .1474272 -2.48 0.013 .2019676
.8291077
everjail | .5976148 .1435622 -2.14 0.032 .3731996
.9569772
alcoholicp~t | 1.365126 .1719276 2.47 0.013 1.066522
1.747331
badparent | 1.181243 .133375 1.48 0.140 .9467374
1.473836
delinquent1 | .9686306 .0948547 -0.33 0.745 .7994714
1.173582
south | 7.391707 6.974655 2.12 0.034 1.162972
46.98076
urate | 1.022676 .019999 1.15 0.252 .9842206
1.062634
rural | 1.105722 .1783717 0.62 0.533 .8059955
1.516908
everkids | 1.316987 .1340499 2.71 0.007 1.078802
1.60776
AFQT | .9999217 .0002457 -0.32 0.750 .9994402
1.000403
_Iedcatrev_2 | 1.214536 .1669344 1.41 0.157 .9277167
1.590031
_Iedcatrev_3 | 1.156426 .1918056 0.88 0.381 .8354816
1.600658
relighome | .7565323 .1801382 -1.17 0.241 .4744032
1.206445
------------------------------------------------------------------------------
. stphtest, detail;
Test of proportional hazards assumption
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
newern | 0.03542 0.71 1 0.3997
cumern | 0.00619 0.03 1 0.8579
jail | 0.00360 0.01 1 0.9184
everjail | 0.06603 3.64 1 0.0565
alcoholicp~t| -0.00158 0.00 1 0.9665
badparent | 0.04819 1.56 1 0.2116
delinquent1 | -0.01933 0.31 1 0.5757
south | -0.00161 0.00 1 0.9671
urate | 0.01949 0.30 1 0.5864
rural | 0.08032 5.01 1 0.0251
everkids | -0.05329 2.42 1 0.1195
AFQT | -0.02263 0.37 1 0.5405
_Iedcatrev_2| -0.00972 0.08 1 0.7794
_Iedcatrev_3| 0.01125 0.10 1 0.7510
relighome | -0.00616 0.03 1 0.8547
------------+---------------------------------------------------
global test | 66.57 79 0.8394
----------------------------------------------------------------
note: robust variance-covariance matrix used.
Next, I limit my sample only to persons who go to prison at some point
during the panel, so the comparison group for those with a prison record at
any given point in time consists of a group that has not yet gone to prison
but will at some point in the future. Bear in mind that, in this sample,
all persons' everjail values will switch from zero to one at some point
during the panel and will then remain one for the duration of the panel.
This specification attenuates the estimated effect of everjail and the
parameter is no longer significant. However, the Schoenfeld residual
analysis also suggests that the effect of everjail is not proportionally
constant over time:
#delimit;
capture drop sch*;
capture drop sca*;
xi: stcox newern cumern jail everjail alcoholicparent badparent
i.stateres
delinquent1 south urate rural everkids i.year AFQT i.edcatrev relighome
if race == 2 & varuse != . & sampid < 15 & sampid != 9 & truejail == 1,
robust schoenfeld(sch*) scaledsch(sca*);
stphtest, detail;
Cox regression -- Breslow method for ties
No. of subjects = 37925698 Number of obs =
2839
No. of failures = 10391186
Time at risk = 423487389.6
Wald chi2(65) =
.
Log pseudolikelihood = -301.42336 Prob > chi2 =
.
(Std. Err. adjusted for 259 clusters in
caseid)
------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
newern | .994557 .0145889 -0.37 0.710 .9663705
1.023566
cumern | 1.024386 .0099382 2.48 0.013 1.005092
1.044051
jail | .4558718 .22238 -1.61 0.107 .175233
1.185959
everjail | .7078746 .2519604 -0.97 0.332 .3523548
1.422108
alcoholicp~t | 1.212183 .3920952 0.59 0.552 .6430384
2.285071
badparent | 1.278569 .5155408 0.61 0.542 .5801032
2.818014
delinquent1 | .731428 .2461317 -0.93 0.353 .3782118
1.414517
south | 2.175701 2.197618 0.77 0.442 .3004853
15.75343
urate | 1.066808 .0537228 1.28 0.199 .9665423
1.177474
rural | 1.429138 .7232303 0.71 0.480 .5300472
3.853307
everkids | 1.963302 .6986453 1.90 0.058 .9774288
3.943565
AFQT | .9988161 .001441 -0.82 0.412 .9959957
1.001644
_Iedcatrev_2 | 1.08698 .3549441 0.26 0.798 .5731508
2.061456
_Iedcatrev_3 | 1.350611 .6145295 0.66 0.509 .5536468
3.294792
relighome | .6680439 .4299661 -0.63 0.531 .1892148
2.358603
------------------------------------------------------------------------------
. stphtest, detail;
Test of proportional hazards assumption
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
newern | -0.07667 0.88 1 0.3493
cumern | 0.14270 3.68 1 0.0552
jail | 0.06943 1.33 1 0.2495
everjail | 0.15710 5.30 1 0.0214
alcoholicp~t| -0.07257 0.93 1 0.3338
badparent | -0.12917 6.45 1 0.0111
delinquent1 | 0.15798 7.93 1 0.0049
south | -0.01537 0.03 1 0.8584
urate | -0.04046 0.27 1 0.6043
rural | 0.17582 7.12 1 0.0076
everkids | -0.05883 1.30 1 0.2539
AFQT | -0.15273 6.03 1 0.0141
_Iedcatrev_2| 0.00988 0.03 1 0.8561
_Iedcatrev_3| 0.15509 4.09 1 0.0433
relighome | -0.02351 0.17 1 0.6773
------------+---------------------------------------------------
global test | 55.46 69 0.8810
----------------------------------------------------------------
note: robust variance-covariance matrix used.
After poking around a bit, I discovered that I got very different
coefficients depending on the age of the respondents I was looking at
(recall that my analysis time is measured in terms of respondents' ages).
As an example, I split the panel roughly in half below. There is a
negative and statistically significant estimated effect of past
incarceration for younger observations:
#delimit;
capture drop sch*;
capture drop sca*;
xi: stcox newern cumern jail everjail alcoholicparent badparent
i.stateres
delinquent1 south urate rural everkids i.year AFQT i.edcatrev relighome
if race == 2 & varuse != . & sampid < 15 & sampid != 9 & agemon < 27,
robust schoenfeld(sch*) scaledsch(sca*);
stphtest, detail;
Cox regression -- Breslow method for ties
No. of subjects = 37925698 Number of obs =
1481
No. of failures = 6287322
Time at risk = 224290651.6
Wald chi2(40) =
.
Log pseudolikelihood = -178.62156 Prob > chi2 =
.
(Std. Err. adjusted for 259 clusters in
caseid)
------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
newern | .993368 .0152892 -0.43 0.666 .9638493
1.023791
cumern | 1.017118 .0108817 1.59 0.113 .9960119
1.038671
jail | .3177393 .1980827 -1.84 0.066 .0936313
1.078253
everjail | .2985212 .1799312 -2.01 0.045 .0916053
.972814
alcoholicp~t | .8084521 .345783 -0.50 0.619 .3496125
1.869484
badparent | 2.041492 1.142844 1.27 0.202 .6814562
6.115855
delinquent1 | .4215357 .1594842 -2.28 0.022 .2008122
.8848683
south | 1.20106 1.361239 0.16 0.872 .1302696
11.07354
urate | 1.015647 .085447 0.18 0.854 .8612531
1.197719
rural | .703451 .4031194 -0.61 0.539 .228794
2.162833
everkids | 1.871234 .8486668 1.38 0.167 .7692722
4.551728
AFQT | 1.000239 .0017253 0.14 0.890 .9968636
1.003627
_Iedcatrev_2 | 1.041943 .4474091 0.10 0.924 .4490959
2.417402
_Iedcatrev_3 | 1.291526 .7877315 0.42 0.675 .3907832
4.268454
relighome | 2.116199 2.446491 0.65 0.517 .2195336
20.39914
------------------------------------------------------------------------------
. stphtest, detail;
Test of proportional hazards assumption
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
newern | -0.10701 1.22 1 0.2688
cumern | 0.03699 0.23 1 0.6318
jail | -0.15441 4.74 1 0.0294
everjail | -0.07598 0.76 1 0.3833
alcoholicp~t| -0.04136 0.20 1 0.6568
badparent | 0.01651 0.08 1 0.7782
delinquent1 | 0.09358 1.40 1 0.2373
south | -0.14422 2.63 1 0.1046
urate | -0.26956 16.12 1 0.0001
rural | -0.04021 0.22 1 0.6358
everkids | -0.15353 7.05 1 0.0079
AFQT | -0.16024 6.05 1 0.0139
_Iedcatrev_2| 0.14036 5.18 1 0.0228
_Iedcatrev_3| 0.14445 3.39 1 0.0656
relighome | 0.16377 6.82 1 0.0090
------------+---------------------------------------------------
global test | 49.93 60 0.8197
----------------------------------------------------------------
note: robust variance-covariance matrix used.
But - and this is where I think something is wrong - there is a very large
*positive* and statistically significant coefficient for older
observations:
#delimit;
capture drop sch*;
capture drop sca*;
xi: stcox newern cumern jail everjail alcoholicparent badparent
i.stateres
delinquent1 south urate rural everkids i.year AFQT i.edcatrev relighome
if race == 2 & varuse != . & sampid < 15 & sampid != 9 & agemon < 27,
robust schoenfeld(sch*) scaledsch(sca*);
stphtest, detail;
Cox regression -- Breslow method for ties
No. of subjects = 29698165 Number of obs =
1358
No. of failures = 4103864
Time at risk = 199196738
Wald chi2(39) =
.
Log pseudolikelihood = -86.623654 Prob > chi2 =
.
(Std. Err. adjusted for 203 clusters in
caseid)
------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
newern | .967959 .0494768 -0.64 0.524 .8756854
1.069956
cumern | 1.057655 .0343948 1.72 0.085 .992346
1.127262
jail | 5.22185 3.696766 2.33 0.020 1.303836
20.91345
everjail | 8.839575 6.892794 2.79 0.005 1.917317
40.75387
alcoholicp~t | 3.424662 2.921447 1.44 0.149 .6434142
18.22824
badparent | .6894594 .4592443 -0.56 0.577 .1868655
2.543831
delinquent1 | 2.242574 1.290387 1.40 0.160 .7260416
6.926791
south | 6.50e+10 9.17e+10 17.66 0.000 4.10e+09 1.03
e+12
urate | 1.1123 .1875627 0.63 0.528 .7992581
1.547949
rural | 3.660132 3.326824 1.43 0.153 .6163246
21.73622
everkids | 2.251156 1.416954 1.29 0.197 .6555872
7.730023
AFQT | .9979546 .0043069 -0.47 0.635 .9895488
1.006432
_Iedcatrev_2 | .7028138 .5085406 -0.49 0.626 .1701883
2.902358
_Iedcatrev_3 | .7220283 .8786676 -0.27 0.789 .0664799
7.841847
relighome | .2068565 .1644372 -1.98 0.047 .0435532
.9824666
------------------------------------------------------------------------------
. stphtest, detail;
Test of proportional hazards assumption
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
newern | -0.02494 0.10 1 0.7494
cumern | 0.00947 0.03 1 0.8672
jail | 0.00265 0.00 1 0.9678
everjail | -0.01087 0.02 1 0.8807
alcoholicp~t| -0.07177 1.43 1 0.2322
badparent | -0.08120 1.64 1 0.2007
delinquent1 | -0.07520 1.21 1 0.2712
south | -0.15235 3.86 1 0.0494
urate | 0.12274 5.78 1 0.0162
rural | 0.08422 1.69 1 0.1933
everkids | -0.20486 11.32 1 0.0008
AFQT | 0.11702 5.16 1 0.0231
_Iedcatrev_2| 0.04897 0.85 1 0.3570
_Iedcatrev_3| 0.01366 0.07 1 0.7863
relighome | 0.02262 0.12 1 0.7268
------------+---------------------------------------------------
global test | 47.49 57 0.8112
----------------------------------------------------------------
note: robust variance-covariance matrix used.
This suggests to me that something isn't working as I'd expect for this
particular specification. By the way, I encounter roughly similar problems
if, rather than splitting up the sample, I use time-varying covariates, or
if I include a "goes to prison at some point during the panel" control
dummy variable rather than simply limiting the sample to this group. But
the problem ONLY occurs if I limit the sample to those who go to prison at
some point during the panel. If I look at results for older observations
in the sample as a whole, the parameter estimate on everjail is
correctly-signed. It may be that having gone to prison has no effect on
one's probability of marrying, but I find it very hard to believe that it
has a strongly *positive* effect for older people. Here's my theory, which
I'm hoping I can get some reaction to: both my dependent variable (the
hazard of first marriage) and my key independent variable (everjail) are
strongly positively correlated with analysis time (age). This is true for
obvious reasons for the dependent variable, and it is true for everjail
since the sample is limited to observations who go to prison at some point
during the panel - if you are in this subsample and haven't gone to jail
today, you're likely to do so next year, and if you don't do so next year,
you're certain to do sometime after that, and so forth. So, since both
variables are positively correlated with analysis time, and since the
"older" sample is limited to people who are guaranteed to have some years
in which everjail == 1 (members of the younger sample may not go to prison
until they are older) and a certain this percentage of whom are also going
to marry, is age confounding the relationship between everjail and the
hazard of marrying? It doesn't seem like this should be possible, since
age - my measure of analysis time - is explicitly being controlled for in
the baseline hazard.
So, my basic question is this: if I'm using a Cox model and have an
independent variable that - like the dependent variable in a Cox analysis -
varies monotonically with analysis time, does that introduce some sort of
strange timing issue into the analysis? Should I expect to get odd
parameter estimates in a situation like this, or am I doubting my results
when in fact I shouldn't be? I'm stumped, so any and all advice would be
most welcome!!
Cheers,
Adam Thomas
John F. Kennedy School of Government,
Harvard University
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/