Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: -tsspell- available on SSC


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: -tsspell- available on SSC
Date   Thu, 15 Aug 2002 17:56:44 +0100

Thanks to Kit Baum, a program -tsspell- for identification of spells
or
runs in time series has been posted on SSC.

The immediate stimulus for writing this program was Guillermo Cruces'
question last Friday, which pointed to the problem of identifying runs
of consecutive observations in panel data, specifically the length of
the longest run.

This posting falls into two main parts: I am going to discuss
Guillermo's specific problem, and then show how -tsspell- is a general
tool for examining one class of problems including his.

One underlying theme recurs frequently on Statalist: there's a direct
solution to the problem making use of Stata's features. However, if
you
do this kind of thing a lot, you might also want a convenience program
which encapsulates some of the basic tricks in the neighbourhood.

To describe or install -tsspell-, use -ssc-. If that doesn't
work, see the first URL under my signature.

1. Runs of consecutive observations in panel data
=================================================

Suppose we have observations for one panel with times

1, 2, 3, 5, 6, 7, 8, 9, 11, 12

Then we have three runs of consecutive observations

1, 2, 3
5, 6, 7, 8, 9
11, 12

and the longest has length 5. To this specific problem, the most
elegant
solution posted was by Vince Wiggins, which presupposes -tsset- data,
that is some prior command like

. tsset id time

specifying the panel identifier and the time variable. For panel, read
"patient", "country", "station" or whatever defines a distinct time
series.

(Guillermo was using -tsset- data.)

Let's look at and then unpack Vince's code:

. by id: gen run = cond(L.run == ., 1, L.run + 1)
. by id: egen maxrun = max(run)

The main idea is to exploit the fact that if there is a gap before any
observation, as before the observations with times 5 and 11 above,
then

L.varname

is missing, for any numeric variable you care to specify. It's also
true
that L.varname[1] is always treated as missing: as there is no
observation before the first, Stata certainly has no idea about its
contents. (Or perhaps, Stata _uncertainly_ has no idea....)

Vince generates a new variable -run- with the rule -- implemented by a
call to the -cond()- function --

	if the previous value (L.run) is missing,
	then we start counting at 1;

	else we just continue counting.

Here we can rely upon Stata generating or replacing observations _in
the
current sort order_ (and, moreover, for any use of time series
operators
to work, the data must be in -tsset- order). This fact doesn't seem to
be documented explicitly, but for one other application, see the FAQ

How can I replace missing values with previous
or following nonmissing values?
http://www.stata.com/support/faqs/data/missing.html

Anyway, given times

1, 2, 3
5, 6, 7, 8, 9
11, 12

the rule -generate-s a new variable -run- with values

1, 2, 3
1, 2, 3, 4, 5
1, 2

because counting restarts after a gap.  The -by id:- in

. by id: gen run = cond(L.run == ., 1, L.run + 1)

flags that we do this separately for each panel. In fact, the -by id:-
is for our benefit rather than Stata's, as the right-hand side is
automatically calculated separately for each panel.  That is, -L.run-
means the previous value of run _for this panel_ whenever panels have
been specified. However, it is important that we specify -by id:-
within

. by id: egen maxrun = max(run)

as -egen- takes no automatic account of separate panels. Clearly we
could look at other characteristics of panel lengths using -egen- or
other commands, depending on what was of interest.

The general case of time series with separate panels also collapses
nicely to the special case of one panel, in which we need not bother
to
specify any panel identifier.

After

. tsset time

. gen run = cond(L.run == ., 1, L.run + 1)

generates a new variable recording sequence in run, and

. egen maxrun = max(run)

would work fine, but would contain the same constant in every
observation, so looking at -run- directly with -summarize- is better.

2. -tsspell-
============

A program -spell- written by Richard Goldstein and myself for
identifying distinct spells has been on SSC for some time. As far as I
can recall, it arose when in response to a question on Statalist
Richard
and I both posted code, and we then put our programs together. In
origin
it predates the introduction of -tsset- and in principle is not tied
to
time series at all, a modest generality that I saw as being good in
principle. In practice, however, I guess that almost surely all
applications of -spell- are to data which are time series. (If they
are not -tsset-, then they cannot be -tsset- very easily.)

However, -spell- has it stands has two main disadvantages:

a. Its name affords Kit Baum intermittent opportunities to make
various
jokes comparing Harry Potter and myself. (The last such comment was a
travesty of the facts, but I am bound by a disclosure agreement from
saying more until book #5 appears.)

b. Less seriously, it doesn't actually sit well with -tsset- and use
of
time series operators. It seems as if it should be compatible with
them,
but it isn't, because of changes it produces to the sort order of
observations (which now I would declare to be bad style).

In short, I revisited -spell- and made it compatible with, indeed
dependent on, a prior -tsset-. At the same time, I simplified it.  My
original co-author Richard Goldstein is recast in the role of
grandparent, with all that that implies.

-tsspell- examines the data, which must be -tsset- time series, to
identify spells or runs, which are contiguous sequences defined by
some
condition.  -tsspell- generates new variables:

(1) indicating distinct spells (0 for not in spell, or integers 1 up);

(2) giving sequence in spell (0 for not in spell, or integers 1 up);
and

(3) indicating whether observations occur at the end of spells (0 or
1).

By default, these variables will be called _spell, _seq and _end.

If the data are panel data, all operations are automatically performed
separately within panels.

There are four ways of defining spells in -tsspell-.

First, given

. tsspell varname

a new spell starts whenever varname changes.  Strictly, the condition
is

(varname != L.varname) | (_n == 1)

(The condition _n == 1 is protection against the possibility that the
first value is missing.)

Second, a new spell starts whenever some condition defining the first
observation in a spell is true. A spell ends just before a new spell
starts. Such a condition may be specified by the -fcond()- option.

An example is Guillermo's problem, in which we wish to divide time
into
spells of consecutive values.  A new spell starts whenever L.varname
is
missing, which as said works for the first observation as well.

Spells started by earthquakes, eruptions, accidents, revolutions,
elections, births or other traumatic events may often be defined in
this
general way.

Third, spells are defined by some condition being true for every
observation in the spell. A spell ends when that condition becomes
false. Such a condition may be specified by the -cond()- option.

Fourth, a special but useful case of the previous kind is

cond(varname > 0 & varname < .)

That is, values of varname are positive (but not missing).  Given
daily
data, spells of rain are defined by there being some rainfall every
day.
As a convenience, such conditions may be specified
by -pcond(varname)-,
or more generally, -pcond(expression)-.

Spells are deemed to end at the last observation.

Specifying -if- and/or -in- adds extra conditions and does not
override
the rule that spells consist of sequences of values.  (N.B. the
behaviour of -spell- is different in this respect.)

Missing values may be ignored by using -if- to exclude them.  They are
not ignored by default, as a convenience to users wishing to explore
patterns of missing values. Recall that numeric missing . is treated
as
larger than any positive number.  Thus be careful to exclude missing
values where appropriate.

I'll close with a list of various examples of -tsspell- in action.

Who is in office:

. tsspell party

Spells are distinct jobs (panel data):

. tsspell job

Number of spells (panel data):

. egen nspells = max(_spell), by(id)

Spells of consecutive values of time:

. tsspell, f(L.time == .)

Rainfall spells:

. tsspell, p(rain)

Spells in which rainfall was at least 10 mm every day:

. tsspell, c(rain >= 10 & rain < .) e(hrend) s(hrseq)

To get information on spell lengths (# observations):

. su hrseq if hrend . tab hrseq if hrend

Length of each spell in a new variable, non-panel and panel data:

. egen length = max(_seq), by(_spell)

. egen length = max(_seq), by(id _spell)

Duration (length in time) of each spell in a new variable, panel data:

. egen tmax = max(time), by(id _spell)
. egen tmin = min(time), by(id _spell)
. gen duration = tmax - tmin

Cumulative totals of varname:

. bysort _spell (_seq) : gen total = sum(varname) if _seq

Sums of varname:

. egen total = sum(varname), by(_spell)

Spells of growth, stability, decline:

. gen sign = sign(D.varname)
. tsspell sign

One observation per spell:

. ... if _end


Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index