The following material is based on postings to
Statalist.
How do I identify runs of consecutive observations in panel data?
|
Title
|
|
Identifying runs of consecutive observations in panel data
|
|
Author
|
Nicholas J. Cox, Durham University, UK
Vince Wiggins, StataCorp
|
|
Date
|
August 2002; minor revisions August 2005
|
Question
I have panel data with some gaps. I want to look systematically at runs of
consecutive observations, especially the length of the longest run in each
panel. How do I do this?
Answer
As so often happens, there is a direct solution to this problem making use
of Stata’s built-in features, and a canned convenience program that
encapsulates some of the basic tricks in the neighborhood. We will describe
both approaches.
Runs of consecutive observations in panel data
Stata’s jargon of panel data borrows one of many possible
terminologies. Depending on your field, you may prefer to think in terms of
each patient, firm, country, station, site, or whatever else it is for which
you have each separate time series. For more background, see help
tsset or [TS]
tsset.
First, suppose that you have tsset your panel data by some command
like
. tsset id time
This command declares the basic structure of your data with a panel
identifier id and a time variable time to Stata’s
time-series commands. It also allows you full use of appropriate features,
including time-series operators, ensuring, in particular, that they work
properly when there are gaps in observations.
Suppose, for example, that we have observations for one panel with times
1, 2, 3, 5, 6, 7, 8, 9, 11, 12
Then we have three runs of consecutive observations
1, 2, 3
5, 6, 7, 8, 9
11, 12
and the longest has length 5. There are gaps before the observations with
times 5 and 11.
Here is a complete solution from first principles, which we will unpack in a
moment:
. gen run = .
. by id: replace run = cond(L.run == ., 1, L.run + 1)
. by id: egen maxrun = max(run)
The main idea is to exploit the fact that if there is a gap before any
observation, as before the observations with times 5 and 11 above, then
L.varname
is missing for any numeric variable you care to specify. It’s also
true that L.varname[1] is always treated as missing.
Since there is no observation before the first, Stata certainly has no idea
about its contents. (Or perhaps, Stata uncertainly has no idea....)
We generate a new variable run containing missing values so
that it will exist for our next step. Then we replace run
with the rule, implemented by a call to the cond() function,
- if the previous value (L.run) is missing, then we start (or
restart) counting at 1;
- else we just continue counting.
Here we can rely on Stata to generate or replace observations in the
current sort order (and, moreover, for any use of time-series operators
to work, the data must be in tsset order). See, for example,
Newson (2004) or the FAQ entitled How can I replace missing values
with previous or following nonmissing values?
(http://www.stata.com/support/faqs/data-management/replacing-missing-values/).
Anyway, given times
1, 2, 3
5, 6, 7, 8, 9
11, 12
the rule replaces the variable run with values
1, 2, 3
1, 2, 3, 4, 5
1, 2
because counting restarts after a gap. The by id: in
. by id: replace run = cond(L.run == ., 1, L.run + 1)
flags that we do this separately for each panel. In fact, the by id:
is for our benefit rather than Stata’s, as the right-hand side is
automatically calculated separately for each panel. That is, L.run
means the previous value of run for this panel whenever panels
have been specified. However, it is important that we specify by id:
within
. by id: egen maxrun = max(run)
as egen takes no automatic account of separate panels. Clearly, we
could look at other properties of panel lengths using egen or other
commands, depending on what was of interest.
The general case of time series with separate panels also collapses nicely
to the special case of one panel, in which we need not bother to specify any
panel identifier. The commands
. tsset time
. gen run = .
. replace run = cond(L.run == ., 1, L.run + 1)
produce a new variable recording sequence in run. Later,
. egen maxrun = max(run)
would work fine, but the new variable would contain the same constant in
every observation, so looking at run directly, say, with
summarize, would be better.
tsspell
A user-written program tsspell may be downloaded using
ssc, which can solve
this problem and several others based on subdividing time series. (If
ssc does not work in your Stata, see the FAQ at
http://www.stata.com/support/faqs/resources/findit-and-ssc-commands/.)
tsspell examines the data, which must be tsset time series, to
identify spells or runs, which are contiguous sequences defined by some
condition. tsspell generates new variables:
- indicating distinct spells (0 for not in spell, or integers 1 up);
- giving sequence in spell (0 for not in spell, or integers 1 up); and
- indicating whether observations occur at the end of spells (0 or 1).
By default, these variables will be called _spell, _seq, and
_end.
If the data are panel data, all operations are automatically performed
separately within panels.
There are four ways of defining spells in tsspell.
First, given
tsspell varname
a new spell starts whenever varname changes. Strictly, the condition
is
(varname != L.varname) | (_n == 1)
Here the condition _n == 1 is protection against the possibility
that the first value is missing.
Second, a new spell starts whenever some condition defining the first
observation in a spell is true. A spell ends just before a new spell starts.
Such a condition may be specified by the fcond() option. Spells
started by earthquakes, eruptions, accidents, revolutions, elections,
births, or other traumatic events may often be defined in this general way.
The problem of runs (or spells) of consecutive observations is an example.
A new spell starts whenever L.varname is missing, which, as
said, works for the first observation as well.
. tsspell, f(L.time == .)
sets up the spells, after which maximum length of run is calculated as
before:
. by id: egen maxrun = max(_seq)
For the example above, the result is indicated by
. list time _spell _seq _end maxrun
+--------------------------------------+
| time _spell _seq _end maxrun |
|--------------------------------------|
1. | 1 1 1 0 5 |
2. | 2 1 2 0 5 |
3. | 3 1 3 1 5 |
4. | 5 2 1 0 5 |
5. | 6 2 2 0 5 |
|--------------------------------------|
6. | 7 2 3 0 5 |
7. | 8 2 4 0 5 |
8. | 9 2 5 1 5 |
9. | 11 3 1 0 5 |
10. | 12 3 2 1 5 |
+--------------------------------------+
Although in this example we have results for only one panel, other panels
would be treated separately.
Third, spells are defined by some condition being true for every observation
in the spell. A spell ends when that condition becomes false. Such a
condition may be specified by the cond() option.
Fourth, a special but useful case of the previous kind is
cond(varname > 0 & varname < .)
That is, values of varname are positive (but not missing). Given
daily data, spells of rain are defined by there being some rainfall every
day. As a convenience, such conditions may be specified by
pcond(varname), or more generally,
pcond(expression).
We will wrap up by mentioning other rules applied by tsspell:
- Spells are deemed to end at the last observation.
- Specifying if and/or in adds extra conditions and does
not override the rule that spells consist of sequences of values.
- Missing values may be ignored by using if to exclude them. They
are not ignored by default, as a convenience to users wishing to
explore patterns of missing values. Recall that numeric missing
. is treated as larger than any positive number. Thus be
careful to exclude missing values where appropriate.
For other examples applying tsspell, please see its help file.
Reference
- Newson, R. 2004.
- Stata tip 13: generate and replace use the current sort order.
Stata Journal 4:
484–485.
|