Re: st: grouped duration-discrete time survival analysis-WAS stset...

 From Enrica Croda To statalist@hsphsun2.harvard.edu Subject Re: st: grouped duration-discrete time survival analysis-WAS stset... Date Mon, 2 Dec 2002 04:24:09 -0800 (PST)

```On Mon, 2 Dec 2002, Stephen P. Jenkins wrote:

> On Sun, 1 Dec 2002 03:07:48 -0800 (PST) Enrica Croda
> <croda@nicco.sscnet.ucla.edu> wrote:
>
> <snip>
>
> > So, to recap, I now believe my data are grouped duration data...
> > I understand that in this case I need to organize my data the so-called
> > "person-period" form.
> > I would appreciate getting feedback on the following:
> > My data are already organized by ID and year in "long" panel data
> > form (iis ID, tis year) with year = 1984, 1985,...1998.
> > A. Do I need to -expand- the data set?
> > I am thinking I just need to generate the analysis time
> > variable, with something like:
> > (A1)	by ID: generate TIME = _n;
> > B. How do I deal with delayed entry?
> > Assuming people first become at risk of not living independently at age 65,
> > which may not be the age at which they are first observed in my data,
> > how do I incorporate this information in my analysis?
>

> Suppose first that there is no delayed entry -- in which case you would
> need a row in the data set corresponding to each year that each person
> was /at risk of experiencing the event of interest/. If you were to
> assume the first year at risk corresponds to age 65, you need rows for
> each person for each year corresponding to age 65+. As the first survey
> year (1984 in GSOEP) is after age 65 for most persons, then you
> would need to create new rows in the data corresponding to those ages
> before the beginning of the survey. The TIME variable starts with 1 for
> age 65, then 2 for age 66, and so on. [You would also need to 'spread'
> values for explanatory variables back onto these new person-year obs.]
> -expand- could probably be used to create the required data structure,
> making using of the -if- qualifier to ensure that the correct number of
> new person-year observations gets generated for each person. (As the
> respondents were of different ages in 1984, the number of new data rows
> will differ from person to person.)
>

Ideally, I would like to use some time-varying variables (e.g. income)
in the analysis. What would be the appropriate thing to do for these
variables when I 'spread' them?

> Now, to control for the delayed entry aspect and get the likelihood
> correct, all you need do is create the data structure as just stated,
> but throw away the person-years corresponding to pre-1984 (first survey
> year). (Note that the duration counter TIME does not start from 1 in
> most cases in the delayed-entry version of the data set.)

I am afraid I am still missing something. Please forgive me if this is a
silly question. If I understand correctly, the only variable I really
need is the appropriate 'analysis time' counter. I will throw away all the
records generated through -expand-. Correct?

If this is correct, could I accomplish the same goal by not expanding at
all, and using NEWTIME rather than TIME as 'analysis time', where NEWTIME
is generated as follow:

by ID: generate newtime= _n + (age[1] - 66);
label variable newtime "analysis time";

by ID: generate agediff= (age[1] - 65) if year==84;
label variable agediff "age-65 in 1984";

by ID: generate ageflag= agediff[1] if (agediff[1]~=.);
label variable ageflag "auxiliary var";

by ID: replace newtime=_n if ageflag==.;

Here is a listing of what I get with this code:

ID       year       age    newtime
201         91        65          1
201         92        66          2
201         93        67          3
201         94        68          4
201         95        69          5
201         96        70          6
201         97        71          7
201         98        72          8

1101         84        78         13
1101         85        79         14
1101         86        80         15
1101         87        81         16
1101         88        82         17
1101         89        83         18
1101         90        84         19
1101         91        85         20
1101         92        86         21
1101         93        87         22
1101         94        88         23
1101         95        89         24
1101         96        90         25
1101         97        91         26
1101         98        92         27

20302         87        65          1
20302         88        66          2
20302         89        67          3
20302         90        68          4
20302         91        69          5
20302         94        72          6
20302         95        73          7
20302         96        74          8
20302         97        75          9
20302         98        76         10

> All this is
> discussed in those lecture notes you cited, together with regression
> models that you could apply once the data have been created.
>

Thanks! Your lecture notes are indeed extremely helpful (I also got your
1995 article in the Oxford Bulletin of Economics and Statistics), and I
think I understand what to do for the estimation part of the project.
It is the preparation of the data set for the analysis that I still find
complicated. (It is the first time I do duration analysis).

> > C. Would the solution to question B be different if I plan to control for
> > age in the 'regression' analysis?
>
> Given the way you have defined your time-at-risk variable (in terms of
> age), wouldn't "age" as an explanatory variable be perfectly correlated
> with TIME?
>

Yes, it would! Thanks for pointing it out!

<snip>

Thank you very much for all your help!

Enrica

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```