Home  /  Resources & support  /  FAQs  /  stsetting spell-type data

How do I convert my spell-type data into a survival dataset?

How do I stset my spell-type data?

Title   stsetting spell-type data
Author Mario Cleves, StataCorp

It is strongly recommended that before reading this FAQ you become familiar with the terms and definitions presented in the stset entry of the Stata manual.

Spell or duration data arise frequently from studies in econometrics and other disciplines. In a typical spell dataset there are multiple observations for each subject, each covering a span of time (a spell) during which the subject is in a given state, such as employed or unemployed. The main difference between this type of data and that required to perform survival analysis is that the latter expects the event of interest, commonly known as the failure event, to occur at the end of the time spanned by the record. It is concerned only with the time at which the transitions from one state to another occurs.

An often overlooked issue is that the failure event must be clearly defined. If we have employment history data recording spells during which an individual is either employed or unemployed, then we need to clearly define the event of interest (the failure event) as either entering employment or entering unemployment. This need brings up another crucial point. If we define our event as transition from unemployment to employment, then a subject is at “risk” of the transition only during those time when the subject is unemployed. Consequently, during time spans when the subject is employed the subject is not at risk of “failure”, and this in fact becomes a time gap in the data. Even though you may not have time gaps in your spell dataset, the resulting survival dataset will probably contain gaps.

In survival data, therefore, each observation must cover a span of time at the end of which the event of interest either occurs or does not. This model requires the subject be at risk of the event (transition from unemployment to employment) during the time span.

Assume we have employment history data recording spells during which an individual is either employed or unemployed. Further, assume we are interested in the transition from unemployed to employed. That is, the “failure event” is becoming employed. Who is at risk of making the transition? Of course, only unemployed individuals. Our spell data look like this:

   ID                 Spelltyp         Begin     End
   101                Employed             1      72
   102                School-unemp        10      20
   102                Employed            20      35
   102                Unemployed          35      40
   103                School-unemp         0      20
   103                Welfare-unem        20      30
   103                Employed            30      60

We have data on three individuals identified by the ID variable (ID=101, ID=102 and ID=103). We will use these data to create the corresponding survival dataset and then stset it.

The first person (ID=101) is already employed at entry and, consequently, not at risk of entering employment. Thus either he should not be included in the study or included as

   ID       Begin    End         Employed
   101          0      1                1

meaning he was unemployed from time 0 to time 1 and entered employment at time 1. We will assume this inference is correct.

For ID=102, we are given three records

   ID                 Spelltyp         Begin     End
   102                School-unemp        10      20
   102                Employed            20      35
   102                Unemployed          35      40

This subject was not under observation from time 0 to time 10, which is what we refer to as left truncation or delayed entry. The first observation indicates that the person was unemployed from time 10 to 20 and entered at time 20. In fact, this person was in school during this time and then entered employment at time 20. He was employed from time 20 to time 35, when he became unemployed. During that time, he was not at risk of the transition. He was already employed; therefore, this record is noninformative and should be left out. This results is what we call in Stata a time gap. He then entered unemployment at time 35 and remained unemployed until time 40. This last observation is censored, because he was still unemployed at the end of this time span. Thus the corresponding survival data are

   ID       Begin    End        Employed
   102         10     20               1
   102         35     40               0

For ID=103, we are also given three records:

   ID                 Spelltyp         Begin     End
   103                School-unemp         0      20
   103                Welfare-unem        20      30
   103                Employed            30      60

This individual was unemployed from time 0 to time 30 when he became employed. He then remained employed until the end of the follow-up period. Although he was in school from time 0 to time 20, and on welfare from time 20 to time 30, there was only one transition from unemployment to employment. Consequently, for this individual there is only one important record.

   ID       Begin    End         Employed
   103          0     30                1

The person was unemployed from time 0 to time 30 and entered employment at time 30.

We could also create two records for this subject. We may need to do this if we have other covariates that are time varying. We will assume we do have time-varying covariates and adopt the following setup:

   ID       Begin    End        Employed
   103          0     20               0
   103         20     30               1

Combining all the above observations, our survival dataset and the corresponding stset command produces

   ID       Begin    End        Employed
   101          0      1               1
   102         10     20               1
   102         35     40               0
   103          0     20               0
   103         20     30               1
 . stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
   
                 id:  ID
      failure event:  Employed != 0 & Employed != .
 obs. time interval:  (Begin, End]
  exit on or before:  time .
 
 -----------------------------------------------------------------------------
         5  total obs.
         0  exclusions
 -----------------------------------------------------------------------------
         5  obs. remaining, representing
         3  subjects
         3  failures in single failure-per-subject data
         1  subject remains remain at risk after failure
        46  total analysis time at risk, at risk from t =         0
                              earliest observed entry t =         0
                                   last observed exit t =        40

Although here our data have only one failure per subject, our data most likely will contain multiple failures per subject resulting from individuals moving in and out of employment status. If this is the case, then you will probably benefit from reading the FAQ Analysis of multiple failure-time data or the article by the same name (Cleves 1999). Regardless of whether you have single or multiple failures per subject data, the logic used to create the survival dataset is as described. The only difference arises when you stset and then analyze the data.

Continuing with our example, we can now describe our data:

 . stdes
   
          failure _d:  Employed
    analysis time _t:  End
   exit on or before:  time .
                  id:  ID
 
                                    |-------------- per subject --------------|
 Category                   total        mean         min     median        max
 ------------------------------------------------------------------------------
 no. of subjects                3   
 no. of records                 5    1.666667           1          2          2
 
 (first) entry time                  3.333333           0          0         10
 (final) exit time                   23.66667           1         30         40
 
 subjects with gap              1   
 time on gap if gap            15          15          15         15         15
 time at risk                  46    15.33333           1         15         30
 
 failures                       3           1           1          1          1
 ------------------------------------------------------------------------------

stdescribe correctly reports that we have 5 observations for three subjects, that one subject has a gap lasting 15 time units (ID=102 from 20 to 35), and that there are three failures in the data (i.e., three transitions from unemployment to employment). Although the original dataset did not contain time gaps, the survival dataset does because of time spans during which the subjects are not at risk of the transition. This is not unusual when transforming spell data into survival time data. Having verified our data, we are now ready to continue our data analysis using other st commands.

We can also set up a survival dataset corresponding to transitions from employment to unemployment by following a similar strategy.

stsetting the data

There are several ways to stset our data. The above dataset was stset in one of these possible ways. The proper stset syntax for the data, however, depends on the study design and assumptions. In what follows we provide guidance for selecting the appropriate stset command syntax. This is only a guide, and idiosyncrasies in your particular data may require more modifications or options.

There are two main questions that need to be answered to stset our data.

Question 1: When does the clock begin ticking?

If you want the “clock” to begin at time zero, then what we did above is correct. For calendar data, t=0 at 1/1/1960, but for the above data, t=0 at 0. The command we used was

   . stset End, failure(Employed) time0(Begin) id(ID) exit(time .)

If we want the “clock” to start ticking for each individual when the subject first enters unemployment, 10 for ID==102 and 0 for the others, then we need to specify origin().

   . stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)

Stata will use as the time origin the earliest entry time per subject.

When origin() is not specified, Stata automatically sets the origin to zero and treats records with entry times greater than zero as left-truncated or delayed-entry observations. That is what we obtained with our original syntax.

Question 2: How do we want to handle each subject’s second, third, etc., observations?

If we want the clock to continue ticking for each individual from the first observation forward, then we can use the syntax we used in our example

   . stset End, failure(Employed) time0(Begin) id(ID) exit(time .)

or, depending on the answer to question 1,

   . stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)

If, on the other hand, we want to reset the clock to zero or the origin() for every observation, then we stset the data without specifying id(). The ID variable can be used later in the analysis to cluster the data and to produce a robust standard error.

   . stset End, failure(Employed) time0(Begin) exit(time .)

or depending on the answer to question 1,

   . stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)

To summarize,

  1. If we want time to start at 0 and continue ticking for subsequent observations, we use
    . stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
    
  2. If we want time to start at 0 and to be reset to zero for every observation, then we do not specify id().
    . stset End, failure(Employed) time0(Begin) exit(time .)
    
  3. If we want time to start at the first entry time for each observation and continue ticking for subsequent observations, then we specify origin().
    . stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
    
  4. If we want to reset the clock to begin at the entry time of each record observation, then we specify origin(), but not id().
    . stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)
    

Reference

Cleves, M. 1999.
ssa13: analysis of multiple failure-time data with Stata. Stata Technical Bulletin 49: 30–39.