Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: re: keep longest consecutive streak (recently broken by Red Sox)


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: re: keep longest consecutive streak (recently broken by Red Sox)
Date   Sat, 15 Jan 2005 18:03:51 -0000

Kit is correct that this code is incorrect, 
as one parenthesis was incorrectly placed, 
a big deal in this case. Should be, 
I think, 

(1) gen spell = sum(L.time == .)
(2) bysort firm spell : gen length = _N
(3) bysort firm (length time) : keep if spell == spell[_N]

Sorry about that. 

Perhaps this code looks (a) bizarre enough and 
also (b) dependent enough on some very Stataish features
to deserve longer explanation. 

(0) We're presupposing panel data that have been 
-tsset-. 

(1) L.time == . is a necessary and sufficient 
criterion for the start of a consecutive spell 
of observations (same panel, and time goes up by 
1 from one to the next), as there is no 
observation in the data for the -time- just 
before the first observation in each such spell. 
Helpfully, this applies also to the very 
first observation in each panel, so we've 
no worries about boundary cases (or spill-overs
from one panel to the next). This is 
official Stata magic, given the way that
-tsset- and operators like -L.- work. 

L.time == . 

creates 1 if true and 0 if false, 
so if you look at the results of that, 
spells for one panel now look something like this, 
inserting gaps just for emphasis, 

		1 
            0
            0 
            0 

            1
            0
            0 

		1

and the cumulative -sum()- makes that 

		1 
            1
            1 
            1 

            2
            2
            2 

		3 

so that the consecutive series 
now are helpfully identified in blocks. 

(2) The length of each spell is just
the number of observations in it. _N 
counts within groups defined by -by:- 
in this instance. 

(3) If we -sort- the longest spell
to the end of each panel, then the 
example above will get mapped to 

	     3 
 
           2 
           2
           2 

           1
           1
           1
           1

and our criterion for keeping 
a spell is then just that -spell- 
has the same value as the very 
last value for each panel. 

Not evident in this explanation, 
but included in the code, is 
sorting the _latest_ longest 
spell to the end if there are 
two or more spells of the same 
maximum length. 

(4) [not here] This needs some 
thought just in case there are 
spells for which -time- is missing. 
Haven't done that. 

P.S. What's the difference between this 
code and the buggy code Kit used? 

bysort firm length (time) : keep if spell == spell[_N] 

keeps the (latest) longest spell of 
each distinct length for each firm. Given "each 
distinct length", "longest" is redundant as
a criterion, but I (and also Kit) get what we asked for. 

bysort firm (length time) : keep if spell == spell[_N] 

keeps the (latest) longest spell for each firm. 

The following FAQ by Vince Wiggins and another 
says more on L.time == . as a criterion 
for the start of a spell of consecutive 
observations. 
http://www.stata.com/support/faqs/data/panel.html

That then goes through another solution. It 
uses -egen-, which in refined Stata circles is 
about as stylish as waltzing with muddy boots on. 

P.P.S. the reference to "Red Sox" sounds like 
an allusion to some local sporting trivium. 
The FAQ says

"Statalist is an international list. 
Please explain details that may make sense 
only in your own corner of the world." 

Nick 
[email protected] 

Kit Baum
 
> Nick said
> 
> More instructive in some ways is to
> do it from scratch, with no use of user
> add-ons. Something like
> 
> gen spell = sum(L.time == .)
> bysort firm spell : gen length = _N
> bysort firm length (time) : keep if spell == spell[_N]
> 
> Nice, except that it does not work:
> 
>         |---------------------------|
>    358. |   10404   1987   .0511182 |
>    359. |   10404   2002   .0337511 |
>    360. |   10404   2003   .0296446 |
>         |---------------------------|
> 
> This firm has the original observations
> 
>         +---------------------------+
>         | npermno   year        ita |
>         |---------------------------|
>    358. |   10404   1987   .0511182 |
>    359. |   10404   1989   .0159272 |
>    360. |   10404   1990   .0455364 |
>    361. |   10404   1992   .0097333 |
>    362. |   10404   1993   .0231792 |
>    363. |   10404   1995   .0534575 |
>    364. |   10404   1996   .0622322 |
>    365. |   10404   2002   .0337511 |
>    366. |   10404   2003   .0296446 |
>         +---------------------------+
> 
> By coincidence, I have been working during the last 24 hours on an 
> ado-file that does this "keep longest streak", but does it 
> listwise for 
> an entire variable list, as is required by some matrix software (that 
> is, we need to generate the longest streak for which NONE of these 
> variables are missing). It also deals with the case, as above, where 
> the longest streak is tied; as an earlier posting suggests, 
> the latest 
> streak should be retained (which is what my code does). I'm 
> pretty sure 
> that it works, but I have given it to Nick to see if he can break it.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index