How can I identify first and last occurrences systematically in panel data?
|
Title
|
|
First and last occurrences in panel data
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
December 2005; updated March 2007
|
The problem
I have panel data (or longitudinal data or cross-sectional time-series
data). I wish to identify systematically the first (or last) occurrences of
a particular condition in each panel with an indicator variable that is 1
when an observation is the first (or last) occurrence in a panel and 0
otherwise. How do I do this?
Example and analysis of the problem
Let us be clear about what the problem is. With panel data, we have one or
more panels with identifiers and a time variable. Thus a panel might look
like this:
id time state
1 1 0
1 2 0
1 3 0
1 4 1
1 5 1
1 6 1
1 7 1
1 8 1
1 9 1
1 10 1
2 1 0
2 2 0
2 3 1
2 4 1
2 5 1
2 6 0
2 7 0
2 8 0
2 9 0
2 10 0
Here the variable state takes on two values, 0 and 1. In
panel 1, state 1 first occurs at time 4, and the individual remains in that
state. In panel 2, the individual is in state 1 only from time 3 to time 5.
We need to be able to deal with both patterns, recognizing that the state
concerned may be absorbing or just temporary.
This example with a binary or indicator variable is about as simple as you
can imagine. We can solve this example easily, and
encouragingly, we can reduce other examples to the same form. Let us
take one step at a time.
A closely parallel FAQ is
http://www.stata.com/support/faqs/data-management/dropping-spells-of-missing-values/. You might like to
read that first, particularly since it spells out some details taken for
granted here.
Identifying the kind of solution needed
For this problem, there is a simple Stata solution, which will be
revealed in a moment. More important, however, is how you can work out the
solution to these and similar problems yourself.
Two elements are immediate. First, the panel structure is crucial here.
For each panel, we must identify the first (or perhaps
last) occurrence of a state, say, state == 1. To experienced Stata
users, this should suggest that you use by varlist:,
here by id:. For more on the syntax, see help by, check out sections in
the manual on by:, or read the tutorial by Cox (2002). What can seem strange at first
sight is that absolutely no looping is needed here. Many Stata users,
especially if they have experience using loops in other languages,
tend to think about problems like this one in terms of looping over the panels
and then over the times within each panel, but
simpler and faster code avoids that. More precisely, code can be
found that does the looping implicitly, with the details managed for
you.
Second, sort order within panels is also crucial. We must work through
values, respecting the order of the time variable.
Although we talk about panel data, we nowhere assume that
you have declared your dataset as panel data to Stata by using
tsset. That is
often a good idea and does no harm here, but it is irrelevant to what
follows.
Particular solutions: First occurrences in panels
Here the cumulative sum sum(state) will be 0 before the
first occurrence, 1 at the first occurrence, and 1 or more thereafter.
id time state
1 1 0
1 2 0
1 3 0
1 4 1
1 5 1
1 6 1
1 7 1
1 8 1
1 9 1
1 10 1
In the first panel, sum(state) would be 0, 0, 0, 1, 2, 3, 4, 5, 6, 7,
and it is characteristic of absorbing states (those that once entered are
never left) that are coded by 1 that sum(state) is 1 precisely once,
on the first occurrence of the state in any panel. This leads us to the
solution for absorbing states coded by 1:
. by id (time), sort: gen byte first = sum(state) == 1
However, this solution is not general enough to cope with nonabsorbing
states. If, for example, state is 1 at time 4 but 0
thereafter, sum(state) will be 1 for all times from 4 onward. A more
general solution is thus
. by id (time), sort: gen noccur = sum(state)
. by id: gen byte first = noccur == 1 & noccur[_n - 1] != noccur
The condition noccur[_n − 1] != noccur catches those cases in
which the previous value noccur[_n − 1] is 0, as well as the case
when noccur equals 1 for the first observation, _n == 1, as
noccur[0] is always treated as missing.
We can also do that using one variable, not two, and one statement, not two:
. by id (time), sort: gen byte first = sum(state) == 1 & sum(state[_n - 1]) == 0
This also works even when the first occurrence of state is also the
first observation in the panel. Then sum(state[_n −
1]) is 0 because state[0] is evaluated as missing and
sum(.) is 0. Otherwise put, the cumulative sum function sum()
is hard-wired to ignore missings. More precisely, it is always initialized
as 0, and it always adds 0 when fed a missing value.
Now let us rewrite this in a more long-winded way that is numerically
equivalent but shows a more general solution in which, no matter what the
state whose first occurrence we are seeking, we can recognize it as the
first occurrence of a condition numerically evaluated as 1.
In this case, if state is 1, then state == 1 is true, and that
condition is evaluated numerically as 1. Similarly, if state is 0,
then state == 1 is false and that condition is evaluated numerically
as 0. So, wherever we have an indicator variable, we would get the same
results numerically by writing down an equivalent true-or-false condition
for Stata to evaluate as 1 or 0. Conversely, to get the benefits of an
indicator variable, all we need to do is write down a true-or-false
condition. For more on these principles, see
http://www.stata.com/support/faqs/data-management/true-and-false/.
. by id (time), sort: gen byte first = sum(state == 1) == 1 & sum(state[_n - 1] == 1) == 0
sum(state[_n − 1] == 1) is 0 even in the awkward case of the
first observation. If _n is 1, then state[0] is evaluated as
missing and is not equal to 1; thus, state[_n - 1] == 1 is false or
numerically 0, so sum(0) is 0.
When was the first occasion on which the frog turned into a
prince?
. by id (time), sort: gen byte first = sum(state == "prince") == 1 & sum(state[_n - 1] == "prince") == 0
When was the first occasion on which the value was at least 42?
. by id (time), sort: gen byte first = sum(inrange(value, 42,.)) == 1 & sum(inrange(value[_n - 1],42,.)) == 0
In this last case, the condition range >= 42 includes
range == ., but that should usually be avoided.
Particular solutions: Last occurrences in panels
To consider values at the end of each panel, we need to start at the
end and work backward. By far, the easiest way to do this is just to
reverse the sort order within each panel and then apply the same logic as
before.
You could change the sort order this way:
. gen ntime = -time
. by id (ntime), sort: whatever
. drop ntime
Another way is to do it with
gsort:
. gsort id -time
. by id: whatever
Either way, you usually want to clean up the sort order again before other
work by a plain tsset if you did do a tsset earlier, by typing
. sort id time
or by typing
. tsset id time
Particular solutions: Using egen
A different approach is to use
egen, which gives a pleasantly direct solution. The first and last
times at which state == 1 are given by
. by id, sort: egen firsttime = min(cond(state == 1, time, .))
. by id: egen lasttime = max(cond(state == 1, time, .))
The key to this approach is to realize that egen, min() and egen,
max() can take expressions, here using the cond() function that
yields either time when state == 1 or missing otherwise. We
are exploiting the fact that Stata ignores missings in calculating extremes.
Thus the first and last times reported for a panel will be missing only if
the condition referred to, here state == 1, is never observed for
that panel. Naturally any other true-or-false condition may be used in place
of state == 1.
Given first and last times, indicator variables are at hand:
. gen byte first = time == firsttime
. gen byte last = time == lasttime
A tacit assumption here is that time takes distinct values within
each panel, which seems likely and is essential if tsset
is to be applied.
If the first and last times themselves, rather than associated indicator
variables, are of most interest, then this approach is doubly attractive. If
somehow you had indicators and not times, then
. by id: egen firsttime = total(first * time)
. by id: egen lasttime = total(last * time)
yields the times. Each total is based on one instance in which an indicator
variable is 1 and other instances in which it is 0, so the result is just
the first time or last time multiplied by one, plus various zeros, or simply
the first or last time.
Reference
- Cox, N. J. 2002.
-
Speaking Stata: How to move step by: step.
Stata Journal 2: 86–102.
|