Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: the id() option in -stset- and "gap-time" conditional risk models


From   Thomas Pepinsky <pepinsky@cornell.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   st: the id() option in -stset- and "gap-time" conditional risk models
Date   Mon, 8 Mar 2010 20:36:20 -0500

A colleague of mine and I are trying to figure out how to estimate a "gap-time" conditional risk set model in Stata. We are having trouble reconciling some various Stata recommendations that seem contradictory to us. Specifically, we are not sure whether or not we need to declare the id() option when we -stset- the data. 

We are using replication data from Box-Steffensmeier and Zorn (2002). Their paper is available here: http://bit.ly/aDLVdZ. The replication data are available here: http://bit.ly/9jAIo4, using the file ag_pwp.dta. We are using Stata 10.

We wish to estimate the effect of democracy on international disputes between pairs of countries. Each subject is a "dyad," which is a pair of countries. Democracy is a time-varying covariate. Disputes are the failure events. Dyads experience multiple failures (i.e. multiple disputes).

Here is a snapshot of the data structure:

dyadid	start	stop	starta	stopa	futime	dispute	sumdisp	democ

2020	0	1	0	1	35	0	0	1
2020	1	2	1	2	35	0	0	1
2020	2	3	2	3	35	0	0	1
2020	3	4	3	4	35	0	0	1
.
.
.
2020	21	22	21	22	35	0	0	1
2020	22	23	22	23	35	0	0	1
2020	23	24	23	24	35	1	1	1
2020	0	1	24	25	35	0	1	1
2020	1	2	25	26	35	0	1	1
2020	2	3	26	27	35	0	1	1
2020	3	4	27	28	35	1	2	1
2020	0	1	28	29	35	0	2	1
.
.
.
2041	0	1	0	1	25	0	0	-.8
2041	1	2	1	2	25	0	0	-.9
2041	2	3	2	3	25	1	1	-.9
2041	0	1	3	4	25	0	1	-.9
2041	1	2	4	5	25	0	1	-.9
2041	2	3	5	6	25	0	1	-.9
2041	3	4	6	7	25	0	1	-.9


DYADID indexes subjects. STOP and STOPA are analysis-time variables that differ based on whether we are counting from entry into the pool for an "elapsed time" model (STOPA) or from the last failure for the "gap" model (STOP). DISPUTE marks a dispute between the two states, which is the failure event. START and STARTA mark when the subject comes under observation, differing in analogous way as STOP and STOPA. FUTIME marks the latest time under which the subject is both under observation and at risk because we have multiple failure data. SUMDISP is the sum of the total number of disputes that have occurred. DEMOC is democracy, our time-varying covariate, defined as the average of the levels of democracy in the two countries in the dyad.

Our confusion arises from what we believe are two contradictory pieces of advice on how to set up the data for analysis using -stset-. 

One one hand, the stset help file (http://www.stata.com/help.cgi?stset) indicates that "Specifying id() never hurts" which we interpret to mean that we should be sure to declare the id() option when -stset-ing our data. If we do that we get the following output:

. stset stop, fail(dispute) exit(futime) enter(start) id(dyadid)

                id:  dyadid
     failure event:  dispute != 0 & dispute < .
obs. time interval:  (stop[_n-1], stop]
 enter on or after:  time start
 exit on or before:  time futime

------------------------------------------------------------------------------
    20448  total obs.
     2621  multiple records at same instant                     PROBABLE ERROR
           (stop[_n-1]==stop)
------------------------------------------------------------------------------
    17827  obs. remaining, representing
      816  subjects
      111  failures in multiple failure-per-subject data
    18471  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        35



On the other hand, the FAQ on multiple failure-time data does NOT include the id() option (http://www.stata.com/support/faqs/stat/stmfail.html) in its example of how to estimate the conditional gap model. If we follow the -stset- procedures outlined there, we get very different output:


. stset stop, fail(dispute) exit(futime) enter(start)

     failure event:  dispute != 0 & dispute < .
obs. time interval:  (0, stop]
 enter on or after:  time start
 exit on or before:  time futime

------------------------------------------------------------------------------
    20448  total obs.
        0  exclusions
------------------------------------------------------------------------------
    20448  obs. remaining, representing
      405  failures in single record/single failure data
    20448  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        35


Without the id() option, Stata considers the data to be single record/single failure data; with the id() option, Stata considers the data to be multiple failure-per-subject data. And these distinctions matter for our substantive conclusions when estimating the regression model. Compare the following two outputs, estimated using the recommended syntax for gap-time conditional risk set models from the FAQ:

. stset stop, fail(dispute) exit(futime) enter(start) id(dyadid)

***output omitted***

. stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron

         failure _d:  dispute
   analysis time _t:  stop
  enter on or after:  time start
  exit on or before:  time futime
                 id:  dyadid

Iteration 0:   log pseudolikelihood =  -304.8235
Iteration 1:   log pseudolikelihood = -304.41482
Iteration 2:   log pseudolikelihood = -304.40997
Iteration 3:   log pseudolikelihood = -304.40997
Refining estimates:
Iteration 0:   log pseudolikelihood = -304.40997

Stratified Cox regr. -- Efron method for ties

No. of subjects      =          816                Number of obs   =     17827
No. of failures      =          111
Time at risk         =        18471
                                                   Wald chi2(1)    =      0.51
Log pseudolikelihood =   -304.40997                Prob > chi2     =    0.4770

                               (Std. Err. adjusted for 816 clusters in dyadid)
------------------------------------------------------------------------------
             |               Robust
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       democ |   .1889203   .2656566     0.71   0.477    -.3317571    .7095977
------------------------------------------------------------------------------
                                                         Stratified by sumdisp



. stset stop, fail(dispute) exit(futime) enter(start)

***output omitted***

. stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron

         failure _d:  dispute
   analysis time _t:  stop
  enter on or after:  time start
  exit on or before:  time futime

Iteration 0:   log pseudolikelihood = -1567.2597
Iteration 1:   log pseudolikelihood = -1567.2407
Iteration 2:   log pseudolikelihood = -1567.2407
Refining estimates:
Iteration 0:   log pseudolikelihood = -1567.2407

Stratified Cox regr. -- Efron method for ties

No. of subjects      =        20448                Number of obs   =     20448
No. of failures      =          405
Time at risk         =        20448
                                                   Wald chi2(1)    =      0.07
Log pseudolikelihood =   -1567.2407                Prob > chi2     =    0.7926

                               (Std. Err. adjusted for 827 clusters in dyadid)
------------------------------------------------------------------------------
             |               Robust
          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       democ |   .0199128   .0757423     0.26   0.793    -.1285394    .1683651
------------------------------------------------------------------------------
                                                         Stratified by sumdisp


We note that the way that the Stata FAQ says to estimate this model (without the id() option) is the way that we have seen it done in other applications, but there are two issues that give us pause. First, when we -stset- the data without the id() option, Stata believes that the data is in single record/single failure, which is not the case for us. We have time-varying covariates, so we must have multiple failure-per-subject data. Second, this contradicts the advice that "Specifying id() never hurts". In this case, it is clear that specifying id() might actually hurt!

Any advice on how to proceed would be most appreciated.

TP

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index