Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: -disjoint- available on SSC (discursive tutorial)


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: -disjoint- available on SSC (discursive tutorial)
Date   Thu, 13 May 2004 21:19:04 +0100

Clint Thompson posted a problem on Tuesday: for a set of people,
he wanted to add up their service in 2000 and 2001, from data of
the form 

	id  start_date end_date 

Here's a report, with some editing out of red herrings, blind
alleys and what not. It's done for two main reasons: it turns out, 
I think, to be an instructive example yielding to some basic Stata 
principles; some one might be able to improve on the treatment or 
point out a problem I've missed. 

This is quite long; it is offered as a tutorial for those
interested in learning some Stata ideas. So persevere, bail out
now, or save this for some later time when the mood strikes. 
(And thanks to Clint for an intriguing problem.) 

Let's say that each such record defines a spell. Doing this
separately for each person is an application of -by:- in some
form. Whether that is done directly or indirectly (e.g. through
-egen, by()-) is a matter of taste and convenience. 

Otherwise the problem, in its simplest form for spells that are
disjoint (i.e. not overlapping), consists of adding up
contributions like        

	1 + 
	min(end_date, mdy(12,31,2001)) - 
	max(mdy(1,1,2000),start_date) 

In fact, a complete solution is 

	gen cont = 1 + min(end_date, mdy(12,31,2001)) 
	        - max(mdy(1,1,2000),start_date) 
	egen sumservice = sum(cont), by(enrolid) 

which could even be telescoped into one line if you were so
inclined.

A first Stata principle here is know your functions, built-in and
-egen-. They can do a lot of work for you. 

As an aside, the 1 here is added on the presumption that anyone
starting and ending on the same date is regarded as serving one
day. In other circumstances, this correction might not be needed:
if you were counting overnight stays in a hotel, someone who
arrived and left on the same day did not stay overnight.

However, the twist that gave this problem its spin turned out to
be that these spells were _not_ necessarily disjoint. For what
Clint wanted, such double counting was not acceptable. (In other 
problems it might be.) 

-findit- pointed to a solution: Edwin Leuven had worked out a
program -spellsplit- which splits overlapping spells into disjoint
segments. (A key Stata principle here is naturally to use
-findit-.) The matter could end here, as Edwin's code works fine
and solved the problem to Clint's satisfaction. His code works by
adding extra observations, and it has extra handles too for
solving other problems. But I was intrigued by the problem and
tried to finish what I had started for myself, partly because 
I wanted also a conservative solution which didn't change the
number of observations.  

What can be tricky I think even for Stata programmers is to know
at what point you should decide to write a proper -program-.  The
circumstances can include: 

* You are fed up with typing in the same commands again and again. 

* You notice a do file is being used so much that a bit of work
making it more general will save you effort, in the longer run. 
 
* Complications such as handling -if- and -in-, the possibility of
missing values, etc., can arise, and you don't trust yourself to
get it right, on the fly, every time; or even if you do, the
details are too tiresome to work through repeatedly. 

In this particular case, I guess the last was nearest the mark. 

What follows is, I guess, mostly but not exclusively of interest
to those who program in Stata. 

Two simple but perhaps not totally obvious tips are to work with 

* a simple test dataset for which you can work out the answers 

* a do file which reads in the data and runs the program, etc. 

In the program itself, some things come with experience: 

* Something like this will depend on getting data in the right
-sort- order. The program itself should be made -sortpreserve- so
that the dataset is not changed in that sense. 

* Use -marksample-'s machinery to identify observations we will 
use (and ensure a swift -exit- if there aren't any). By convention
we work with a binary variable `touse' which is 1 for observations
we're using and 0 otherwise. 

* Build in some checks, such that start dates are not greater than
end dates and that start and end dates are both integers. (The
last assumption is made explicit to save the programmer the work
of trying to be totally general.) 

* Allow an identifier variable but don't insist on one (here
`id'). 

* At the back of our mind is a belief that we shouldn't need 
to loop over observations. Getting the right -sort- order
and some minor trickery with -by:-, _n and _N will give us
cleaner and much faster code. (This is one of the most Stataish
things to get accustomed to.) 

The -sort- order is conveniently first on `touse' `id'. There'll 
always be a `touse'. If there's no `id', the macro will be
undefined and evaluate to an empty string, not a problem here. 
(In other circumstances, we often do something like 

	by `touse' `by': 

So long as there's always a `touse', Stata won't complain if `by' 
is empty.) 

I wasted some time following the guess that within such blocks the
natural order was 

       `start' `end' 

which eventually I decided was not quite the best thing, as I'll
explain shortly. 

Throughout all this, the most important scribbles were not of code
in my favourite text editor, but on bits of paper on which I
visualised spells of various kinds: 

Let's assume that time flows from left to right, so that for an
individual three distinct spells are 

       **** 
              ***
                     ** 

and, for another, four (overlapping) spells are 

       *****
         ***************
         ****************** 
                   ********* 

These are as you can see sorted on `start' `end': the spell with
the earliest `start' is listed first and the ties on `start' are
broken according to `end'. Two ideas emerged from such scribbles,
one positive and one negative. 

The positive idea was that an algorithm could be based on the
maximum date so far seen (i.e. the latest end date yet seen).  

The negative idea was that the algorithm would have to able to
cope with enclosed spells. To borrow a phrase from the scientist
Richard Levins, the programmer should not make "decency
assumptions" about the data and assume that pathologies, meaning
something awkward for the programmer, don't arise. Enclosed spells
might have the same `start' but different `ends'

       ****
       *****
       *******

or the same `end' but different `starts' 

              ****
        **********

or they could be snuggled up inside longer spells: 

       *******************
              **********
                     ************ 

Cases like that above turn out to be the North Face on this little
Eiger. The middle spell adds no information: we already know (in
Clint's terms; no, not Clint Eastwood, Clint Thompson) that the
person was in service in that time, but we have got to get the
program to ignore such an uninformative spell. 

More messing around than I really want to confess in public led
to the conclusion that the best -sort- order was on `start' and
-`end' (i.e. negated ends), so that we have spells in orders 
like this: 

       *******
       ****
       *** 
        ****
        *** 
           ***************** 
                         **************** 
                                          **** 

Then we keep a track of the maximum so far (here denoted M): 

       ******M
       ****  M 
       ***   M 
        **** M 
        ***  M 
           ****************M 
                         ***************M 
                                          ***M 

Now we have a criterion for identifying what should be ignored: if
(and only if) we have a new maximum, we thereby have a spell which
adds information to what we already know. Also, we see the point
of the sort order `start' -`end': that way, we see longer spells
first, for a given `start'. 

Turning to the code, and adding a few Stata comments, and some 
commentary on top of that: 

       tempvar negend max 
              
       // sort on start then on -end; then find maximum so far 
       gen `negend' = -`end'
       gen `max' = . 
       bysort `touse' `id' (`start' `negend') : ///
              replace `max' = max(`end', `max'[_n-1])        

In using -max(,)- there are two principles which deserve emphasis. 

One which may surprise is that -max(.,<non-missing>)- yields the
non-missing value as a result, even though you've learned since
early Stata days that missing is treated as arbitrarily large.
This is, the more you think about it, what you (usually) really
expect. You wouldn't want to be told that

       max(-1, 0, 2.71828, 3.14159, 42,.)

is . (missing): just as with a simple look at data you'd ignore the 
missings and report 42. 

Another principle mentioned in some places (not enough) is that 
you can rely on -replace- following the -sort- order of the 
data. So that within blocks of observations 

       gen `max' = . 
       replace `max' = max(`end', `max'[_n-1])        

is code for running maximum. Stata doesn't allow here 

       gen `max' = max(`end', `max'[_n-1])

as the RHS has got to be defined before the LHS can be. 

So what we then do is mark out of consideration all those 
enclosed spells 

       // don't use spell if end not a new maximum 
       // (i.e. spell enclosed within another spell) 
       by `touse' `id' : replace `touse' = 0 
                            if `max' == `max'[_n-1]   

Our first approximation to the ends of disjoint spells is then                             
       
       // first stab, without correction for overlap 
       gen `generate' = `end' if `touse'   

We changed `touse', so we should -sort- the observations to be ignored 
out of the way. 

       *******
       ****
       *** 
        ****
        *** 
           ***************** 
                         **************** 
                                          **** 

is thereby reduced to 

       *******
           ***************** 
                         **************** 
                                          **** 

Now we just look down from spell to spell. If the next spell
starts before this one ends, we chop off the overlap. Our new END
variable (denoted E) will be 

       ***E     
           *************E    
                         ***************E 
                                          ***E 

Now our spells are disjoint. Here's the code. 

        // if each used spell overlaps with next, 
        // chop off the overlap 
        bysort `touse' `id' (`start') : ///
                replace `generate' = `start'[_n+1] - 1 ///
                if `end' >= `start'[_n+1] & `touse'

We should worry a little about boundary cases here (and 
elsewhere). For example, for the last observation in each 
block, `start'[_N+1] will be treated as missing, but `end'[_N] 
can never be missing, as any missing `end's will have been 
marked not `touse' by the -marksample- statement not shown 
here. More generally, things happening not as you want at
the beginning and end of blocks (_n == 1, _n == _N) are 
a source of bugs. 

Thanks to Kit Baum, the code is posted as -disjoint- on SSC. 		

Nick 
n.j.cox@durham.ac.uk 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index