Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: bysort, forvalues and _N |

Date |
Sat, 26 Jan 2013 00:39:20 +0000 |

This is something I have in draft form: Often dates for transactions or events come in pairs, say start and finish or open and close. People arrive for an appointment at a clinic and later leave; people order goods, which are later delivered; people start work for an employer, and later leave. In all cases, when we take a snapshot, we may find people still in the clinic, goods not yet delivered, or (it is to be hoped) people still working for the employer. With such events, it is natural that each pair of events is often recorded as an observation (case, row, or record) in a dataset. Such a structure makes some calculations easy. For exanple, the differences between arrivals and departures or orders and deliveries are key to system performance, although the ideal will be short delays for seeing patients or selling goods and long delays for periods of employment or lifetimes. In Stata with two variables for (say) -arrival- and -depart-, the delay is just . gen lapse = depart - arrival Precisely how we record time will depend on the problem, but here I am imagining a date or date-time or time variable. If the closing events have yet to happen, then -depart- may need to be recorded as missing. If so, -lapse- will in turn be missing. Notice, by the way, a simple data quality check that the time lapse can never be negative. Time lapses recorded as zeros might also need checking in some situations; was the system really that fast and efficient (or ruthless)? Such a simple data structure -- one observation for each transaction -- may also be awkward, which leads to the main reason for this tip. Experience with data structures in Stata might lead readers to suggest a -reshape long-, which could be a good idea, but there is an easier alternative, to use -expand-. We need first a unique or distinct identifier for each transaction, which may already exist. The command -isid- allows a simple check of whether an identifier variable indeed matches its purpose. If the identifier variable is broken or non-existent, then something like . gen long id = _n creates a new identifier fit for purpose. Specifying a -long- variable type allows over 2 billion distinct positive identifiers, should they be needed. Otherwise we use the existing identifier. Then . expand 2 is the abracadabra needed. We turn each observation into two. The new observations are added at the end of the dataset, so we need to sort them before we can create two new variables that are the keys to other calculations. . bysort id: gen time = cond(_n == 1, arrival, depart) Each distinct value of the identifier now occurs precisely twice. We can therefore use the framework provided by -by:-. See Cox (2002) for a tutorial if desired. Under -by:-, the observation number _n is interpreted within groups (here all pairs) and we assign -arrival- to the first observation of two and -depart- to the second. Let's imagine a small section of a toy dataset and apply our expansion method. . list +-----------------------+ | id arrival depart | |-----------------------| 1. | 1 1000 1100 | 2. | 2 1100 1300 | 3. | 3 1200 1400 | +-----------------------+ . expand 2 (3 observations created) . bysort id : gen time = cond(_n == 1, arrival, depart) . by id : gen inout = cond(_n == 1, 1, -1) . sort time . list, sep(0) +--------------------------------------+ | id arrival depart inout time | |--------------------------------------| 1. | 1 1000 1100 1 1000 | 2. | 1 1000 1100 -1 1100 | 3. | 2 1100 1300 1 1100 | 4. | 3 1200 1400 1 1200 | 5. | 2 1100 1300 -1 1300 | 6. | 3 1200 1400 -1 1400 | +--------------------------------------+ The flag variable -inout- records additions and subtractions, so that the cumulative or running sum keeps track of the number inside the system. In a jargon common in economics, flows are used to calculate stocks, it being understood that any stock from before the start of records would need to be added. . gen present = sum(inout) . list, sep(0) +------------------------------------------------+ | id arrival depart time inout present | |------------------------------------------------| 1. | 1 1000 1100 1000 1 1 | 2. | 1 1000 1100 1100 -1 0 | 3. | 2 1100 1300 1100 1 1 | 4. | 3 1200 1400 1200 1 2 | 5. | 2 1100 1300 1300 -1 1 | 6. | 3 1200 1400 1400 -1 0 | +------------------------------------------------+ This is only one trick, and others will depend on your problem. For example, if a clinic is only open daily, the number present should drop to zero at the end of each day. More generally, stocks can not be negative. The logic of how your system operates provides a logic for your code and checks on data quality. For a different problem arising with paired data, see Cox (2008). Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86--102. Cox, N. J. 2008. Tip 71: The problem of split identity, or how to group dyads. Stata Journal 8: 588--591. On Fri, Jan 25, 2013 at 11:26 PM, Constantin Alba <constantin.alba@gmail.com> wrote: > thanks Nick, > > that is what i was looking for exactly. and the data actually was like > that (start/end date) originally. so i just used the trick on it. the > results are correct. > > I did understand how this trick worked in the example after the link, > but in current example i am missing it. can you, please, explain the > use of "bysort obsno" here? > > thanks > > C. > > On Fri, Jan 25, 2013 at 10:10 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> I wouldn't approach it like that. I don't think a loop is required, >> but rather a different data structure. Also, holding Duration may be >> interesting and useful, but it is just awkward for this problem. >> >> First save the dataset if not already -save-d. Then >> >> gen long obsno = _n >> gen endDate = startDate + Duration >> expand 2 >> bysort obsno : gen Date = cond(_n == 1, startDate, endDate) >> bysort obsno : gen inOut = cond(_n == 1, 1, -1) >> bysort PersonID (Date) : gen noActivities = sum(inOut) >> >> Compare e.g. thread starting at >> >> http://www.stata.com/statalist/archive/2012-11/msg01163.html >> >> Nick >> >> On Fri, Jan 25, 2013 at 7:38 PM, Constantin Alba >> <constantin.alba@gmail.com> wrote: >> >>> I have a dataset organized by personID, for each personID there are >>> startDate and Duration for various activities. >>> >>> I want to find out how many activities a person performs at the same >>> time (e.g. something like workload) >>> >>> logically i see it like this: >>> >>> sort personID startDate Duration >>> >>> by personID: forvalues i=1/`=_N' { >>> "compare current date with all >>> other prior dates+duration >>> and update counter accordingly" >>> } >>> >>> however, by and forvalues do not work together, so i am trying to find >>> an alternative way to do so. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: bysort, forvalues and _N***From:*KOTa <kota.alba@gmail.com>

**References**:**st: bysort, forvalues and _N***From:*Constantin Alba <constantin.alba@gmail.com>

**Re: st: bysort, forvalues and _N***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: bysort, forvalues and _N***From:*Constantin Alba <constantin.alba@gmail.com>

- Prev by Date:
**Re: st: bysort, forvalues and _N** - Next by Date:
**st: filling missing values** - Previous by thread:
**Re: st: bysort, forvalues and _N** - Next by thread:
**Re: st: bysort, forvalues and _N** - Index(es):