Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Data management question


From   wgould@stata.com (William Gould)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Data management question
Date   Mon, 22 Jul 2002 08:26:03 -0500

David Tucker celdjt@umich.edu writes, 

> [...]  I am working with a file containing a mix of cases, some with single
> observations, others with multiple observations.  [...]  My plan is to model
> these data using survival analysis [...]  when I attempt to set up the data
> using snapspan, I end up with almost complete missing data (excepting for id
> value and failure indicator) for cases with single observations as well as
> for the first record in cases with multiple observations. Is there away to
> correct this, or to avoid having it happen?

I suspect the problem is either (1) the dataset with which David is starting 
is not really a snapshot data set or (2) it is a snapshot dataset, but an 
incomplete one that needs a little processing for using -snapspan-.

Let's start by understanding what are snapshot and timespan datasets:


Background
----------

A snapshot dataset is a dataset in which each observation represents 
the status of a subject at a point in time:


        Reality for subject_id = 1

           snapshot     snapshot      snapshot
             t=0          t=1           t=2
              |            |             |
        |-----+------------+-------------+----------->   time
              |            |             |
           status=0      status=1    status=2 
            x = 1        x = 2         x = 3


        The corresponding snapshot dataset:

              subject_id     time     status     x
                       1        0          0     1
                       1        1          1     2                     (1)
                       1        2          2     3

To perform survival analysis, we need a time-span dataset in which 
each observation represents a span of time and the variables represent the 
values over the span time time, except for any status variables, which 
represent the status at the end of the span of time

        Correspoding time-span dataset:

              subject_id     time0    time1      x    status
                       1        ??        0     ??         0
                       1         0        1      1         1           (2)
                       1         1        2      2         2
                       1         2       ??      3        ??

It is important to understand the transformation from (1) to (2).  k 
obsrvations went to k+1 observations because, given k snapshots, there 
are k+1 internvals:

           snapshot     snapshot      snapshot
             t=0          t=1           t=2
              |            |             |
        |-----+------------+-------------+----------->   time
        <-----|            |-------------|
          |   |------------|      |      |----------> 
          |          |        interval 3       |
          |     interval 2                 interval 4
     interval 1

There were two kinds of variables in our original snapshot datasets:

        1.  Status variables that indicated something that happened or 
            was true at that instant.  Examples:  admitted, discharged, 
            died, became unemployed, found employment, ...

        2.  Measurment variables that indicate the status of a measurement 
            that we are willing to assume to remain constant until the 
            next snapshot.  Examples:  sex, race, age, educational 
            attainment, ...

In converting the snapshot dataset into a timespan dataset, status variables 
and measurement variables are treated differently.  Status variables are 
associted with the end of the interval.  Measurement variables are associated
with the beginning of the interval.  When I converted

              subject_id     time     status     x
                       1        0          0     1
                       1        1          1     2                     (1)
                       1        2          2     3
into 
              subject_id     time0    time1      x    status
                       1        ??        0     ??         0
                       1         0        1      1         1           (2)
                       1         1        2      2         2
                       1         2       ??      3        ??
              
I copied the x variable for time0==time and I copied the status variable
for time1=time.  

Now, as far as survival analysis in concerned, the first and last observations
of the resulting time-span dataset are useless, so let's throw them away:

              subject_id     time0    time1      x    status
                       1         0        1      1         1           (3)
                       1         1        2      2         2

That is what -snapspan- does, or at least what it ought to do because then 
it would be easier to understand:  k observations become k+1 intervals of
which k-1 are kept.  In fact, -snapspan- keeps the first observation, so the
result will be

              subject_id     time0    time1      x    status
                       1        ??        0     ??         0
                       1         0        1      1         1           (4)
                       1         1        2      2         2

-stset- later will know to ignore the first observation because it has no 
information.

The syntax of the -snapspan- command is

        . spanspan <idvar> <timevar> <other status vars>, generate(<tovar>)

I suggest anyone who is plans on using -snapspan- enter the snapshort dataset
just illustrated

        . clear
        . input subject_id time status x 
        . 1 0 0 1
        . 1 1 1 2
        . 1 2 2 3
        . end

and then type 

	. snapspan subject_id t status, replace gen(t0)

and -list- the result.


David's problem
---------------

David notes that single-observation snapshot observations turn into 
useless observations in the result.  So they should, because there is 
insufficient information for performing survival analysis if I only 
observe a person once.

It may be, however, that David's dataset is not exactly a snapshot dataset.
What if the single observations mean the following mean, "we observed the
person only twice, once at baseline and then again a year later."  The
documentation might continue to read, "For persons observed after that, we
record extra observations for when they were observed."

In that case, David as a strange combination of timespan and snapshot dataset, 
to which the solution might be:

         . by subject_id: expand 2 if _N==1 
         . replace time = . in <originalobs>/l
         . sort subject_id time
         . by subject_id: replace time = time+1 if time==1 & _n==2

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index