[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: -disjoint- available on SSC (discursive tutorial) |

Date |
Thu, 13 May 2004 21:19:04 +0100 |

Clint Thompson posted a problem on Tuesday: for a set of people, he wanted to add up their service in 2000 and 2001, from data of the form id start_date end_date Here's a report, with some editing out of red herrings, blind alleys and what not. It's done for two main reasons: it turns out, I think, to be an instructive example yielding to some basic Stata principles; some one might be able to improve on the treatment or point out a problem I've missed. This is quite long; it is offered as a tutorial for those interested in learning some Stata ideas. So persevere, bail out now, or save this for some later time when the mood strikes. (And thanks to Clint for an intriguing problem.) Let's say that each such record defines a spell. Doing this separately for each person is an application of -by:- in some form. Whether that is done directly or indirectly (e.g. through -egen, by()-) is a matter of taste and convenience. Otherwise the problem, in its simplest form for spells that are disjoint (i.e. not overlapping), consists of adding up contributions like 1 + min(end_date, mdy(12,31,2001)) - max(mdy(1,1,2000),start_date) In fact, a complete solution is gen cont = 1 + min(end_date, mdy(12,31,2001)) - max(mdy(1,1,2000),start_date) egen sumservice = sum(cont), by(enrolid) which could even be telescoped into one line if you were so inclined. A first Stata principle here is know your functions, built-in and -egen-. They can do a lot of work for you. As an aside, the 1 here is added on the presumption that anyone starting and ending on the same date is regarded as serving one day. In other circumstances, this correction might not be needed: if you were counting overnight stays in a hotel, someone who arrived and left on the same day did not stay overnight. However, the twist that gave this problem its spin turned out to be that these spells were _not_ necessarily disjoint. For what Clint wanted, such double counting was not acceptable. (In other problems it might be.) -findit- pointed to a solution: Edwin Leuven had worked out a program -spellsplit- which splits overlapping spells into disjoint segments. (A key Stata principle here is naturally to use -findit-.) The matter could end here, as Edwin's code works fine and solved the problem to Clint's satisfaction. His code works by adding extra observations, and it has extra handles too for solving other problems. But I was intrigued by the problem and tried to finish what I had started for myself, partly because I wanted also a conservative solution which didn't change the number of observations. What can be tricky I think even for Stata programmers is to know at what point you should decide to write a proper -program-. The circumstances can include: * You are fed up with typing in the same commands again and again. * You notice a do file is being used so much that a bit of work making it more general will save you effort, in the longer run. * Complications such as handling -if- and -in-, the possibility of missing values, etc., can arise, and you don't trust yourself to get it right, on the fly, every time; or even if you do, the details are too tiresome to work through repeatedly. In this particular case, I guess the last was nearest the mark. What follows is, I guess, mostly but not exclusively of interest to those who program in Stata. Two simple but perhaps not totally obvious tips are to work with * a simple test dataset for which you can work out the answers * a do file which reads in the data and runs the program, etc. In the program itself, some things come with experience: * Something like this will depend on getting data in the right -sort- order. The program itself should be made -sortpreserve- so that the dataset is not changed in that sense. * Use -marksample-'s machinery to identify observations we will use (and ensure a swift -exit- if there aren't any). By convention we work with a binary variable `touse' which is 1 for observations we're using and 0 otherwise. * Build in some checks, such that start dates are not greater than end dates and that start and end dates are both integers. (The last assumption is made explicit to save the programmer the work of trying to be totally general.) * Allow an identifier variable but don't insist on one (here `id'). * At the back of our mind is a belief that we shouldn't need to loop over observations. Getting the right -sort- order and some minor trickery with -by:-, _n and _N will give us cleaner and much faster code. (This is one of the most Stataish things to get accustomed to.) The -sort- order is conveniently first on `touse' `id'. There'll always be a `touse'. If there's no `id', the macro will be undefined and evaluate to an empty string, not a problem here. (In other circumstances, we often do something like by `touse' `by': So long as there's always a `touse', Stata won't complain if `by' is empty.) I wasted some time following the guess that within such blocks the natural order was `start' `end' which eventually I decided was not quite the best thing, as I'll explain shortly. Throughout all this, the most important scribbles were not of code in my favourite text editor, but on bits of paper on which I visualised spells of various kinds: Let's assume that time flows from left to right, so that for an individual three distinct spells are **** *** ** and, for another, four (overlapping) spells are ***** *************** ****************** ********* These are as you can see sorted on `start' `end': the spell with the earliest `start' is listed first and the ties on `start' are broken according to `end'. Two ideas emerged from such scribbles, one positive and one negative. The positive idea was that an algorithm could be based on the maximum date so far seen (i.e. the latest end date yet seen). The negative idea was that the algorithm would have to able to cope with enclosed spells. To borrow a phrase from the scientist Richard Levins, the programmer should not make "decency assumptions" about the data and assume that pathologies, meaning something awkward for the programmer, don't arise. Enclosed spells might have the same `start' but different `ends' **** ***** ******* or the same `end' but different `starts' **** ********** or they could be snuggled up inside longer spells: ******************* ********** ************ Cases like that above turn out to be the North Face on this little Eiger. The middle spell adds no information: we already know (in Clint's terms; no, not Clint Eastwood, Clint Thompson) that the person was in service in that time, but we have got to get the program to ignore such an uninformative spell. More messing around than I really want to confess in public led to the conclusion that the best -sort- order was on `start' and -`end' (i.e. negated ends), so that we have spells in orders like this: ******* **** *** **** *** ***************** **************** **** Then we keep a track of the maximum so far (here denoted M): ******M **** M *** M **** M *** M ****************M ***************M ***M Now we have a criterion for identifying what should be ignored: if (and only if) we have a new maximum, we thereby have a spell which adds information to what we already know. Also, we see the point of the sort order `start' -`end': that way, we see longer spells first, for a given `start'. Turning to the code, and adding a few Stata comments, and some commentary on top of that: tempvar negend max // sort on start then on -end; then find maximum so far gen `negend' = -`end' gen `max' = . bysort `touse' `id' (`start' `negend') : /// replace `max' = max(`end', `max'[_n-1]) In using -max(,)- there are two principles which deserve emphasis. One which may surprise is that -max(.,<non-missing>)- yields the non-missing value as a result, even though you've learned since early Stata days that missing is treated as arbitrarily large. This is, the more you think about it, what you (usually) really expect. You wouldn't want to be told that max(-1, 0, 2.71828, 3.14159, 42,.) is . (missing): just as with a simple look at data you'd ignore the missings and report 42. Another principle mentioned in some places (not enough) is that you can rely on -replace- following the -sort- order of the data. So that within blocks of observations gen `max' = . replace `max' = max(`end', `max'[_n-1]) is code for running maximum. Stata doesn't allow here gen `max' = max(`end', `max'[_n-1]) as the RHS has got to be defined before the LHS can be. So what we then do is mark out of consideration all those enclosed spells // don't use spell if end not a new maximum // (i.e. spell enclosed within another spell) by `touse' `id' : replace `touse' = 0 if `max' == `max'[_n-1] Our first approximation to the ends of disjoint spells is then // first stab, without correction for overlap gen `generate' = `end' if `touse' We changed `touse', so we should -sort- the observations to be ignored out of the way. ******* **** *** **** *** ***************** **************** **** is thereby reduced to ******* ***************** **************** **** Now we just look down from spell to spell. If the next spell starts before this one ends, we chop off the overlap. Our new END variable (denoted E) will be ***E *************E ***************E ***E Now our spells are disjoint. Here's the code. // if each used spell overlaps with next, // chop off the overlap bysort `touse' `id' (`start') : /// replace `generate' = `start'[_n+1] - 1 /// if `end' >= `start'[_n+1] & `touse' We should worry a little about boundary cases here (and elsewhere). For example, for the last observation in each block, `start'[_N+1] will be treated as missing, but `end'[_N] can never be missing, as any missing `end's will have been marked not `touse' by the -marksample- statement not shown here. More generally, things happening not as you want at the beginning and end of blocks (_n == 1, _n == _N) are a source of bugs. Thanks to Kit Baum, the code is posted as -disjoint- on SSC. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: multiple imputation and transformations** - Next by Date:
**st: foreach question** - Previous by thread:
**st: multiple imputation and transformations** - Next by thread:
**st: foreach question** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |