# Re: st: Data management question

 From wgould@stata.com (William Gould) To statalist@hsphsun2.harvard.edu Subject Re: st: Data management question Date Mon, 22 Jul 2002 08:26:03 -0500

```David Tucker celdjt@umich.edu writes,

> [...]  I am working with a file containing a mix of cases, some with single
> observations, others with multiple observations.  [...]  My plan is to model
> these data using survival analysis [...]  when I attempt to set up the data
> using snapspan, I end up with almost complete missing data (excepting for id
> value and failure indicator) for cases with single observations as well as
> for the first record in cases with multiple observations. Is there away to
> correct this, or to avoid having it happen?

I suspect the problem is either (1) the dataset with which David is starting
is not really a snapshot data set or (2) it is a snapshot dataset, but an
incomplete one that needs a little processing for using -snapspan-.

Let's start by understanding what are snapshot and timespan datasets:

Background
----------

A snapshot dataset is a dataset in which each observation represents
the status of a subject at a point in time:

Reality for subject_id = 1

snapshot     snapshot      snapshot
t=0          t=1           t=2
|            |             |
|-----+------------+-------------+----------->   time
|            |             |
status=0      status=1    status=2
x = 1        x = 2         x = 3

The corresponding snapshot dataset:

subject_id     time     status     x
1        0          0     1
1        1          1     2                     (1)
1        2          2     3

To perform survival analysis, we need a time-span dataset in which
each observation represents a span of time and the variables represent the
values over the span time time, except for any status variables, which
represent the status at the end of the span of time

Correspoding time-span dataset:

subject_id     time0    time1      x    status
1        ??        0     ??         0
1         0        1      1         1           (2)
1         1        2      2         2
1         2       ??      3        ??

It is important to understand the transformation from (1) to (2).  k
obsrvations went to k+1 observations because, given k snapshots, there
are k+1 internvals:

snapshot     snapshot      snapshot
t=0          t=1           t=2
|            |             |
|-----+------------+-------------+----------->   time
<-----|            |-------------|
|   |------------|      |      |---------->
|          |        interval 3       |
|     interval 2                 interval 4
interval 1

There were two kinds of variables in our original snapshot datasets:

1.  Status variables that indicated something that happened or
was true at that instant.  Examples:  admitted, discharged,
died, became unemployed, found employment, ...

2.  Measurment variables that indicate the status of a measurement
that we are willing to assume to remain constant until the
next snapshot.  Examples:  sex, race, age, educational
attainment, ...

In converting the snapshot dataset into a timespan dataset, status variables
and measurement variables are treated differently.  Status variables are
associted with the end of the interval.  Measurement variables are associated
with the beginning of the interval.  When I converted

subject_id     time     status     x
1        0          0     1
1        1          1     2                     (1)
1        2          2     3
into
subject_id     time0    time1      x    status
1        ??        0     ??         0
1         0        1      1         1           (2)
1         1        2      2         2
1         2       ??      3        ??

I copied the x variable for time0==time and I copied the status variable
for time1=time.

Now, as far as survival analysis in concerned, the first and last observations
of the resulting time-span dataset are useless, so let's throw them away:

subject_id     time0    time1      x    status
1         0        1      1         1           (3)
1         1        2      2         2

That is what -snapspan- does, or at least what it ought to do because then
it would be easier to understand:  k observations become k+1 intervals of
which k-1 are kept.  In fact, -snapspan- keeps the first observation, so the
result will be

subject_id     time0    time1      x    status
1        ??        0     ??         0
1         0        1      1         1           (4)
1         1        2      2         2

-stset- later will know to ignore the first observation because it has no
information.

The syntax of the -snapspan- command is

. spanspan <idvar> <timevar> <other status vars>, generate(<tovar>)

I suggest anyone who is plans on using -snapspan- enter the snapshort dataset
just illustrated

. clear
. input subject_id time status x
. 1 0 0 1
. 1 1 1 2
. 1 2 2 3
. end

and then type

. snapspan subject_id t status, replace gen(t0)

and -list- the result.

David's problem
---------------

David notes that single-observation snapshot observations turn into
useless observations in the result.  So they should, because there is
insufficient information for performing survival analysis if I only
observe a person once.

It may be, however, that David's dataset is not exactly a snapshot dataset.
What if the single observations mean the following mean, "we observed the
person only twice, once at baseline and then again a year later."  The
documentation might continue to read, "For persons observed after that, we
record extra observations for when they were observed."

In that case, David as a strange combination of timespan and snapshot dataset,
to which the solution might be:

. by subject_id: expand 2 if _N==1
. replace time = . in <originalobs>/l
. sort subject_id time
. by subject_id: replace time = time+1 if time==1 & _n==2

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```