How do I convert my spell-type data into a survival dataset?
How do I stset my spell-type data?
|
Title
|
|
stsetting spell-type data
|
|
Author
|
Mario Cleves, StataCorp
|
|
Date
|
November 1999
|
It is strongly recommended that before reading this FAQ you become familiar
with the terms and definitions presented in the
stset entry of the
Stata manual.
Spell or duration data arise frequently from studies in econometrics and
other disciplines. In a typical spell dataset there are multiple
observations for each subject, each covering a span of time (a spell) during
which the subject is in a given state, such as employed or
unemployed. The main difference between this type of data and that required
to perform survival analysis is that the latter expects the event of
interest, commonly known as the failure event, to occur at the end of the
time spanned by the record. It is concerned only with the time at which the
transitions from one state to another occurs.
An often overlooked issue is that the failure event must be clearly defined.
If we have employment history data recording spells during which an
individual is either employed or unemployed, then we need to clearly define
the event of interest (the failure event) as either entering employment or
entering unemployment. This need brings up another crucial point. If we
define our event as transition from unemployment to employment, then a
subject is at “risk” of the transition only during those time
when the subject is unemployed. Consequently, during time spans when the
subject is employed the subject is not at risk of “failure”, and
this in fact becomes a time gap in the data. Even though you may not have
time gaps in your spell dataset, the resulting survival dataset will
probably contain gaps.
In survival data, therefore, each observation must cover a span of time at
the end of which the event of interest either occurs or does not. This model
requires the subject be at risk of the event (transition from
unemployment to employment) during the time span.
Assume we have employment history data recording spells during which an
individual is either employed or unemployed. Further, assume we are
interested in the transition from unemployed to employed. That is, the
“failure event” is becoming employed. Who is at risk of making
the transition? Of course, only unemployed individuals. Our spell data look
like this:
ID Spelltyp Begin End
101 Employed 1 72
102 School-unemp 10 20
102 Employed 20 35
102 Unemployed 35 40
103 School-unemp 0 20
103 Welfare-unem 20 30
103 Employed 30 60
We have data on three individuals identified by the ID variable (ID=101,
ID=102 and ID=103). We will use these data to create the corresponding
survival dataset and then stset it.
The first person (ID=101) is already employed at entry and, consequently,
not at risk of entering employment. Thus either he should not be included in
the study or included as
ID Begin End Employed
101 0 1 1
meaning he was unemployed from time 0 to time 1 and entered employment
at time 1. We will assume this inference is correct.
For ID=102, we are given three records
ID Spelltyp Begin End
102 School-unemp 10 20
102 Employed 20 35
102 Unemployed 35 40
This subject was not under observation from time 0 to time 10, which is what
we refer to as left truncation or delayed entry. The first observation
indicates that the person was unemployed from time 10 to 20 and entered at
time 20. In fact, this person was in school during this time and then
entered employment at time 20. He was employed from time 20 to time 35, when
he became unemployed. During that time, he was not at risk of the
transition. He was already employed; therefore, this record is
noninformative and should be left out. This results is what we call in Stata
a time gap. He then entered unemployment at time 35 and remained unemployed
until time 40. This last observation is censored, because he was still
unemployed at the end of this time span. Thus the corresponding survival
data are
ID Begin End Employed
102 10 20 1
102 35 40 0
For ID=103, we are also given three records:
ID Spelltyp Begin End
103 School-unemp 0 20
103 Welfare-unem 20 30
103 Employed 30 60
This individual was unemployed from time 0 to time 30 when he became
employed. He then remained employed until the end of the follow-up period.
Although he was in school from time 0 to time 20, and on welfare from time
20 to time 30, there was only one transition from unemployment to
employment. Consequently, for this individual there is only one important
record.
ID Begin End Employed
103 0 30 1
The person was unemployed from time 0 to time 30 and entered employment at
time 30.
We could also create two records for this subject. We may need to do this if
we have other covariates that are time varying. We will assume we do
have time-varying covariates and adopt the following setup:
ID Begin End Employed
103 0 20 0
103 20 30 1
Combining all the above observations, our survival dataset and the
corresponding stset command produces
ID Begin End Employed
101 0 1 1
102 10 20 1
102 35 40 0
103 0 20 0
103 20 30 1
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
id: ID
failure event: Employed != 0 & Employed != .
obs. time interval: (Begin, End]
exit on or before: time .
-----------------------------------------------------------------------------
5 total obs.
0 exclusions
-----------------------------------------------------------------------------
5 obs. remaining, representing
3 subjects
3 failures in single failure-per-subject data
1 subject remains remain at risk after failure
46 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 40
Although here our data have only one failure per subject, our data most
likely will contain multiple failures per subject resulting from individuals
moving in and out of employment status. If this is the case, then you will
probably benefit from reading the FAQ
Analysis of multiple
failure-time data or the article by the same name (Cleves 1999).
Regardless of whether you have single or multiple failures per subject data,
the logic used to create the survival dataset is as described. The only
difference arises when you stset and then analyze the data.
Continuing with our example, we can now describe our data:
. stdes
failure _d: Employed
analysis time _t: End
exit on or before: time .
id: ID
|-------------- per subject --------------|
Category total mean min median max
------------------------------------------------------------------------------
no. of subjects 3
no. of records 5 1.666667 1 2 2
(first) entry time 3.333333 0 0 10
(final) exit time 23.66667 1 30 40
subjects with gap 1
time on gap if gap 15 15 15 15 15
time at risk 46 15.33333 1 15 30
failures 3 1 1 1 1
------------------------------------------------------------------------------
stdes correctly
reports that we have 5 observations for three subjects, that one subject has
a gap lasting 15 time units (ID=102 from 20 to 35), and that there are three
failures in the data (i.e., three transitions from unemployment to
employment). Although the original dataset did not contain time gaps, the
survival dataset does because of time spans during which the subjects are
not at risk of the transition. This is not unusual when transforming spell
data into survival time data. Having verified our data, we are now ready to
continue our data analysis using other st commands.
We can also set up a survival dataset corresponding to transitions from
employment to unemployment by following a similar strategy.
stsetting the data
There are several ways to
stset our data.
The above dataset was stset in one of these possible ways. The proper
stset syntax for the data, however, depends on the study design and
assumptions. In what follows we provide guidance for selecting the
appropriate stset command syntax. This is only a guide, and
idiosyncrasies in your particular data may require more modifications or
options.
There are two main questions that need to be answered to stset our
data.
Question 1: When does the clock begin ticking?
If you want the “clock” to begin at time zero, then what we did
above is correct. For calendar data, t=0 at 1/1/1960, but for the
above data, t=0 at 0. The command we used was
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
If we want the “clock” to start ticking for each individual when
the subject first enters unemployment, 10 for ID==102 and 0 for the
others, then we need to specify origin().
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
Stata will use as the time origin the earliest entry time per subject.
When origin() is not specified, Stata automatically sets the origin
to zero and treats records with entry times greater than zero as
left-truncated or delayed-entry observations. That is what we obtained with
our original syntax.
Question 2: How do we want to handle each subject’s second,
third, etc., observations?
If we want the clock to continue ticking for each individual from the first
observation forward, then we can use the syntax we used in our example
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
or, depending on the answer to question 1,
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
If, on the other hand, we want to reset the clock to zero or the
origin() for every observation, then we stset the data without
specifying id(). The ID variable can be used later in the
analysis to cluster the data and to produce a robust standard error.
. stset End, failure(Employed) time0(Begin) exit(time .)
or depending on the answer to question 1,
. stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)
To summarize,
- If we want time to start at 0 and continue ticking for subsequent
observations, we use
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
- If we want time to start at 0 and to be reset to zero for every
observation, then we do not specify id().
. stset End, failure(Employed) time0(Begin) exit(time .)
- If we want time to start at the first entry time for each observation
and continue ticking for subsequent observations, then we specify
origin().
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
- If we want to reset the clock to begin at the entry time of each record
observation, then we specify origin(), but not id().
. stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)
Reference
-
Cleves, M. 1999.
-
ssa13: analysis of multiple failure-time data with Stata.
Stata Technical Bulletin 49: 30–39.
|