Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
KOTa <kota.alba@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: bysort, forvalues and _N |

Date |
Sat, 26 Jan 2013 11:52:05 +0200 |

thank you C. On Sat, Jan 26, 2013 at 2:39 AM, Nick Cox <njcoxstata@gmail.com> wrote: > This is something I have in draft form: > > Often dates for transactions or events come in pairs, say start and > finish or open and close. People arrive for an appointment at a clinic > and later leave; people order goods, which are later delivered; people > start work for an employer, and later leave. In all cases, when we take > a snapshot, we may find people still in the clinic, goods not yet > delivered, or (it is to be hoped) people still working for the employer. > > With such events, it is natural that each pair of events is often > recorded as an observation (case, row, or record) in a dataset. Such a > structure makes some calculations easy. For exanple, the differences > between arrivals and departures or orders and deliveries are key to > system performance, although the ideal will be short delays for seeing > patients or selling goods and long delays for periods of employment or > lifetimes. In Stata with two variables for (say) -arrival- and -depart-, > the delay is just > > . gen lapse = depart - arrival > > Precisely how we record time will depend on the problem, but here I am > imagining a date or date-time or time variable. > > If the closing events have yet to happen, then -depart- may need to > be recorded as missing. If so, -lapse- will in turn be missing. > Notice, by the way, a simple data quality check that the time lapse can > never be negative. Time lapses recorded as zeros might also need > checking in some situations; was the system really that fast and > efficient (or ruthless)? > > Such a simple data structure -- one observation for each transaction -- > may also be awkward, which leads to the main reason for this tip. > Experience with data structures in Stata might lead readers to suggest a > -reshape long-, which could be a good idea, but there is an easier > alternative, to use -expand-. > > We need first a unique or distinct identifier for each transaction, > which may already exist. The command -isid- allows a simple check of > whether an identifier variable indeed matches its purpose. If the > identifier variable is broken or non-existent, then something like > > . gen long id = _n > > creates a new identifier fit for purpose. Specifying a -long- > variable type allows over 2 billion distinct positive identifiers, > should they be needed. Otherwise we use the existing identifier. Then > > . expand 2 > > is the abracadabra needed. We turn each observation into two. The new > observations are added at the end of the dataset, so we need to sort > them before we can create two new variables that are the keys to other > calculations. > > . bysort id: gen time = cond(_n == 1, arrival, depart) > > Each distinct value of the identifier now occurs precisely twice. We can > therefore use the framework provided by -by:-. See Cox (2002) for a > tutorial if desired. Under -by:-, the observation number _n is > interpreted within groups (here all pairs) and we assign -arrival- > to the first observation of two and -depart- to the second. > > Let's imagine a small section of a toy dataset and apply our expansion > method. > > . list > > +-----------------------+ > | id arrival depart | > |-----------------------| > 1. | 1 1000 1100 | > 2. | 2 1100 1300 | > 3. | 3 1200 1400 | > +-----------------------+ > > . expand 2 > (3 observations created) > > . bysort id : gen time = cond(_n == 1, arrival, depart) > > . by id : gen inout = cond(_n == 1, 1, -1) > > . sort time > > . list, sep(0) > > +--------------------------------------+ > | id arrival depart inout time | > |--------------------------------------| > 1. | 1 1000 1100 1 1000 | > 2. | 1 1000 1100 -1 1100 | > 3. | 2 1100 1300 1 1100 | > 4. | 3 1200 1400 1 1200 | > 5. | 2 1100 1300 -1 1300 | > 6. | 3 1200 1400 -1 1400 | > +--------------------------------------+ > > The flag variable -inout- records additions and subtractions, so > that the cumulative or running sum keeps track of the number inside the > system. In a jargon common in economics, flows are used to calculate > stocks, it being understood that any stock from before the start of > records would need to be added. > > . gen present = sum(inout) > > . list, sep(0) > > +------------------------------------------------+ > | id arrival depart time inout present | > |------------------------------------------------| > 1. | 1 1000 1100 1000 1 1 | > 2. | 1 1000 1100 1100 -1 0 | > 3. | 2 1100 1300 1100 1 1 | > 4. | 3 1200 1400 1200 1 2 | > 5. | 2 1100 1300 1300 -1 1 | > 6. | 3 1200 1400 1400 -1 0 | > +------------------------------------------------+ > > This is only one trick, and others will depend on your problem. For > example, if a clinic is only open daily, the number present should drop > to zero at the end of each day. More generally, stocks can not be > negative. The logic of how your system operates provides a logic for your > code and checks on data quality. > > For a different problem arising with paired data, see Cox (2008). > > Cox, N. J. 2002. > Speaking Stata: How to move step by: step. > Stata Journal 2: 86--102. > > Cox, N. J. 2008. > Tip 71: The problem of split identity, or how to group dyads. > Stata Journal 8: 588--591. > > On Fri, Jan 25, 2013 at 11:26 PM, Constantin Alba > <constantin.alba@gmail.com> wrote: >> thanks Nick, >> >> that is what i was looking for exactly. and the data actually was like >> that (start/end date) originally. so i just used the trick on it. the >> results are correct. >> >> I did understand how this trick worked in the example after the link, >> but in current example i am missing it. can you, please, explain the >> use of "bysort obsno" here? >> >> thanks >> >> C. >> >> On Fri, Jan 25, 2013 at 10:10 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> I wouldn't approach it like that. I don't think a loop is required, >>> but rather a different data structure. Also, holding Duration may be >>> interesting and useful, but it is just awkward for this problem. >>> >>> First save the dataset if not already -save-d. Then >>> >>> gen long obsno = _n >>> gen endDate = startDate + Duration >>> expand 2 >>> bysort obsno : gen Date = cond(_n == 1, startDate, endDate) >>> bysort obsno : gen inOut = cond(_n == 1, 1, -1) >>> bysort PersonID (Date) : gen noActivities = sum(inOut) >>> >>> Compare e.g. thread starting at >>> >>> http://www.stata.com/statalist/archive/2012-11/msg01163.html >>> >>> Nick >>> >>> On Fri, Jan 25, 2013 at 7:38 PM, Constantin Alba >>> <constantin.alba@gmail.com> wrote: >>> >>>> I have a dataset organized by personID, for each personID there are >>>> startDate and Duration for various activities. >>>> >>>> I want to find out how many activities a person performs at the same >>>> time (e.g. something like workload) >>>> >>>> logically i see it like this: >>>> >>>> sort personID startDate Duration >>>> >>>> by personID: forvalues i=1/`=_N' { >>>> "compare current date with all >>>> other prior dates+duration >>>> and update counter accordingly" >>>> } >>>> >>>> however, by and forvalues do not work together, so i am trying to find >>>> an alternative way to do so. > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: bysort, forvalues and _N***From:*Constantin Alba <constantin.alba@gmail.com>

**Re: st: bysort, forvalues and _N***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: bysort, forvalues and _N***From:*Constantin Alba <constantin.alba@gmail.com>

**Re: st: bysort, forvalues and _N***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: filling missing values** - Next by Date:
**Re: st: advice on using Stata in an undergraduate Intro Stats course** - Previous by thread:
**Re: st: bysort, forvalues and _N** - Next by thread:
**st: RE: capture xtmixed** - Index(es):