Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Thomas Pepinsky <pepinsky@cornell.edu> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
st: the id() option in -stset- and "gap-time" conditional risk models |

Date |
Mon, 8 Mar 2010 20:36:20 -0500 |

A colleague of mine and I are trying to figure out how to estimate a "gap-time" conditional risk set model in Stata. We are having trouble reconciling some various Stata recommendations that seem contradictory to us. Specifically, we are not sure whether or not we need to declare the id() option when we -stset- the data. We are using replication data from Box-Steffensmeier and Zorn (2002). Their paper is available here: http://bit.ly/aDLVdZ. The replication data are available here: http://bit.ly/9jAIo4, using the file ag_pwp.dta. We are using Stata 10. We wish to estimate the effect of democracy on international disputes between pairs of countries. Each subject is a "dyad," which is a pair of countries. Democracy is a time-varying covariate. Disputes are the failure events. Dyads experience multiple failures (i.e. multiple disputes). Here is a snapshot of the data structure: dyadid start stop starta stopa futime dispute sumdisp democ 2020 0 1 0 1 35 0 0 1 2020 1 2 1 2 35 0 0 1 2020 2 3 2 3 35 0 0 1 2020 3 4 3 4 35 0 0 1 . . . 2020 21 22 21 22 35 0 0 1 2020 22 23 22 23 35 0 0 1 2020 23 24 23 24 35 1 1 1 2020 0 1 24 25 35 0 1 1 2020 1 2 25 26 35 0 1 1 2020 2 3 26 27 35 0 1 1 2020 3 4 27 28 35 1 2 1 2020 0 1 28 29 35 0 2 1 . . . 2041 0 1 0 1 25 0 0 -.8 2041 1 2 1 2 25 0 0 -.9 2041 2 3 2 3 25 1 1 -.9 2041 0 1 3 4 25 0 1 -.9 2041 1 2 4 5 25 0 1 -.9 2041 2 3 5 6 25 0 1 -.9 2041 3 4 6 7 25 0 1 -.9 DYADID indexes subjects. STOP and STOPA are analysis-time variables that differ based on whether we are counting from entry into the pool for an "elapsed time" model (STOPA) or from the last failure for the "gap" model (STOP). DISPUTE marks a dispute between the two states, which is the failure event. START and STARTA mark when the subject comes under observation, differing in analogous way as STOP and STOPA. FUTIME marks the latest time under which the subject is both under observation and at risk because we have multiple failure data. SUMDISP is the sum of the total number of disputes that have occurred. DEMOC is democracy, our time-varying covariate, defined as the average of the levels of democracy in the two countries in the dyad. Our confusion arises from what we believe are two contradictory pieces of advice on how to set up the data for analysis using -stset-. One one hand, the stset help file (http://www.stata.com/help.cgi?stset) indicates that "Specifying id() never hurts" which we interpret to mean that we should be sure to declare the id() option when -stset-ing our data. If we do that we get the following output: . stset stop, fail(dispute) exit(futime) enter(start) id(dyadid) id: dyadid failure event: dispute != 0 & dispute < . obs. time interval: (stop[_n-1], stop] enter on or after: time start exit on or before: time futime ------------------------------------------------------------------------------ 20448 total obs. 2621 multiple records at same instant PROBABLE ERROR (stop[_n-1]==stop) ------------------------------------------------------------------------------ 17827 obs. remaining, representing 816 subjects 111 failures in multiple failure-per-subject data 18471 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 35 On the other hand, the FAQ on multiple failure-time data does NOT include the id() option (http://www.stata.com/support/faqs/stat/stmfail.html) in its example of how to estimate the conditional gap model. If we follow the -stset- procedures outlined there, we get very different output: . stset stop, fail(dispute) exit(futime) enter(start) failure event: dispute != 0 & dispute < . obs. time interval: (0, stop] enter on or after: time start exit on or before: time futime ------------------------------------------------------------------------------ 20448 total obs. 0 exclusions ------------------------------------------------------------------------------ 20448 obs. remaining, representing 405 failures in single record/single failure data 20448 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 35 Without the id() option, Stata considers the data to be single record/single failure data; with the id() option, Stata considers the data to be multiple failure-per-subject data. And these distinctions matter for our substantive conclusions when estimating the regression model. Compare the following two outputs, estimated using the recommended syntax for gap-time conditional risk set models from the FAQ: . stset stop, fail(dispute) exit(futime) enter(start) id(dyadid) ***output omitted*** . stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron failure _d: dispute analysis time _t: stop enter on or after: time start exit on or before: time futime id: dyadid Iteration 0: log pseudolikelihood = -304.8235 Iteration 1: log pseudolikelihood = -304.41482 Iteration 2: log pseudolikelihood = -304.40997 Iteration 3: log pseudolikelihood = -304.40997 Refining estimates: Iteration 0: log pseudolikelihood = -304.40997 Stratified Cox regr. -- Efron method for ties No. of subjects = 816 Number of obs = 17827 No. of failures = 111 Time at risk = 18471 Wald chi2(1) = 0.51 Log pseudolikelihood = -304.40997 Prob > chi2 = 0.4770 (Std. Err. adjusted for 816 clusters in dyadid) ------------------------------------------------------------------------------ | Robust _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- democ | .1889203 .2656566 0.71 0.477 -.3317571 .7095977 ------------------------------------------------------------------------------ Stratified by sumdisp . stset stop, fail(dispute) exit(futime) enter(start) ***output omitted*** . stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron failure _d: dispute analysis time _t: stop enter on or after: time start exit on or before: time futime Iteration 0: log pseudolikelihood = -1567.2597 Iteration 1: log pseudolikelihood = -1567.2407 Iteration 2: log pseudolikelihood = -1567.2407 Refining estimates: Iteration 0: log pseudolikelihood = -1567.2407 Stratified Cox regr. -- Efron method for ties No. of subjects = 20448 Number of obs = 20448 No. of failures = 405 Time at risk = 20448 Wald chi2(1) = 0.07 Log pseudolikelihood = -1567.2407 Prob > chi2 = 0.7926 (Std. Err. adjusted for 827 clusters in dyadid) ------------------------------------------------------------------------------ | Robust _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- democ | .0199128 .0757423 0.26 0.793 -.1285394 .1683651 ------------------------------------------------------------------------------ Stratified by sumdisp We note that the way that the Stata FAQ says to estimate this model (without the id() option) is the way that we have seen it done in other applications, but there are two issues that give us pause. First, when we -stset- the data without the id() option, Stata believes that the data is in single record/single failure, which is not the case for us. We have time-varying covariates, so we must have multiple failure-per-subject data. Second, this contradicts the advice that "Specifying id() never hurts". In this case, it is clear that specifying id() might actually hurt! Any advice on how to proceed would be most appreciated. TP * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: re: programming error with -generate-** - Next by Date:
**st: re: missing string entry** - Previous by thread:
**st: re: programming error with -generate-** - Next by thread:
**st: re: missing string entry** - Index(es):