Title | First and last occurrences in panel data | |
Author | Nicholas J. Cox, Durham University, UK | |
Date | December 2005; updated March 2007 |
I have panel data (or longitudinal data or cross-sectional time-series data). I wish to identify systematically the first (or last) occurrences of a particular condition in each panel with an indicator variable that is 1 when an observation is the first (or last) occurrence in a panel and 0 otherwise. How do I do this?
Let us be clear about what the problem is. With panel data, we have one or more panels with identifiers and a time variable. Thus a panel might look like this:
id time state 1 1 0 1 2 0 1 3 0 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1 10 1 2 1 0 2 2 0 2 3 1 2 4 1 2 5 1 2 6 0 2 7 0 2 8 0 2 9 0 2 10 0
Here the variable state takes on two values, 0 and 1. In panel 1, state 1 first occurs at time 4, and the individual remains in that state. In panel 2, the individual is in state 1 only from time 3 to time 5. We need to be able to deal with both patterns, recognizing that the state concerned may be absorbing or just temporary.
This example with a binary or indicator variable is about as simple as you can imagine. We can solve this example easily, and encouragingly, we can reduce other examples to the same form. Let us take one step at a time.
A closely parallel FAQ is FAQ: How can I drop spells of missing values at the beginning and end of panel data?. You might like to read that first, particularly since it spells out some details taken for granted here.
For this problem, there is a simple Stata solution, which will be revealed in a moment. More important, however, is how you can work out the solution to these and similar problems yourself.
Two elements are immediate. First, the panel structure is crucial here. For each panel, we must identify the first (or perhaps last) occurrence of a state, say, state == 1. To experienced Stata users, this should suggest that you use by varlist:, here by id:. For more on the syntax, see by, check out sections in the manual on by:, or read the tutorial by Cox (2002).
What can seem strange at first sight is that absolutely no looping is needed here. Many Stata users, especially if they have experience using loops in other languages, tend to think about problems like this one in terms of looping over the panels and then over the times within each panel, but simpler and faster code avoids that. More precisely, code can be found that does the looping implicitly, with the details managed for you.
Second, sort order within panels is also crucial. We must work through values, respecting the order of the time variable.
Although we talk about panel data, we nowhere assume that you have declared your dataset as panel data to Stata by using tsset. That is often a good idea and does no harm here, but it is irrelevant to what follows.
Here the cumulative sum sum(state) will be 0 before the first occurrence, 1 at the first occurrence, and 1 or more thereafter.
id time state 1 1 0 1 2 0 1 3 0 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1 10 1
In the first panel, sum(state) would be 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, and it is characteristic of absorbing states (those that once entered are never left) that are coded by 1 that sum(state) is 1 precisely once, on the first occurrence of the state in any panel. This leads us to the solution for absorbing states coded by 1:
. by id (time), sort: gen byte first = sum(state) == 1
However, this solution is not general enough to cope with nonabsorbing states. If, for example, state is 1 at time 4 but 0 thereafter, sum(state) will be 1 for all times from 4 onward. A more general solution is thus
. by id (time), sort: gen noccur = sum(state) . by id: gen byte first = noccur == 1 & noccur[_n - 1] != noccur
The condition noccur[_n − 1] != noccur catches those cases in which the previous value noccur[_n − 1] is 0, as well as the case when noccur equals 1 for the first observation, _n == 1, as noccur[0] is always treated as missing.
We can also do that using one variable, not two, and one statement, not two:
. by id (time), sort: gen byte first = sum(state) == 1 & sum(state[_n - 1]) == 0
This also works even when the first occurrence of state is also the first observation in the panel. Then sum(state[_n − 1]) is 0 because state[0] is evaluated as missing and sum(.) is 0. Otherwise put, the cumulative sum function sum() is hard-wired to ignore missings. More precisely, it is always initialized as 0, and it always adds 0 when fed a missing value.
Now let us rewrite this in a more long-winded way that is numerically equivalent but shows a more general solution in which, no matter what the state whose first occurrence we are seeking, we can recognize it as the first occurrence of a condition numerically evaluated as 1.
In this case, if state is 1, then state == 1 is true, and that condition is evaluated numerically as 1. Similarly, if state is 0, then state == 1 is false and that condition is evaluated numerically as 0. So, wherever we have an indicator variable, we would get the same results numerically by writing down an equivalent true-or-false condition for Stata to evaluate as 1 or 0. Conversely, to get the benefits of an indicator variable, all we need to do is write down a true-or-false condition. For more on these principles, see FAQ: What is true and false in Stata?.
. by id (time), sort: gen byte first = sum(state == 1) == 1 & sum(state[_n - 1] == 1) == 0
sum(state[_n − 1] == 1) is 0 even in the awkward case of the first observation. If _n is 1, then state[0] is evaluated as missing and is not equal to 1; thus, state[_n - 1] == 1 is false or numerically 0, so sum(0) is 0.
When was the first occasion on which the frog turned into a prince?
. by id (time), sort: gen byte first = sum(state == "prince") == 1 & sum(state[_n - 1] == "prince") == 0
When was the first occasion on which the value was at least 42?
. by id (time), sort: gen byte first = sum(inrange(value, 42,.)) == 1 & sum(inrange(value[_n - 1],42,.)) == 0
In this last case, the condition range >= 42 includes range == ., but that should usually be avoided.
To consider values at the end of each panel, we need to start at the end and work backward. By far, the easiest way to do this is just to reverse the sort order within each panel and then apply the same logic as before.
You could change the sort order this way:
. gen ntime = -time . by id (ntime), sort: whatever . drop ntime
Another way is to do it with gsort:
. gsort id -time . by id: whatever
Either way, you usually want to clean up the sort order again before other work by a plain tsset if you did do a tsset earlier, by typing
. sort id time
or by typing
. tsset id time
A different approach is to use egen, which gives a pleasantly direct solution. The first and last times at which state == 1 are given by
. by id, sort: egen firsttime = min(cond(state == 1, time, .)) . by id: egen lasttime = max(cond(state == 1, time, .))
The key to this approach is to realize that egen, min() and egen, max() can take expressions, here using the cond() function that yields either time when state == 1 or missing otherwise. We are exploiting the fact that Stata ignores missings in calculating extremes. Thus the first and last times reported for a panel will be missing only if the condition referred to, here state == 1, is never observed for that panel. Naturally any other true-or-false condition may be used in place of state == 1.
Given first and last times, indicator variables are at hand:
. gen byte first = time == firsttime . gen byte last = time == lasttime
A tacit assumption here is that time takes distinct values within each panel, which seems likely and is essential if tsset is to be applied.
If the first and last times themselves, rather than associated indicator variables, are of most interest, then this approach is doubly attractive. If somehow you had indicators and not times, then
. by id: egen firsttime = total(first * time) . by id: egen lasttime = total(last * time)
yields the times. Each total is based on one instance in which an indicator variable is 1 and other instances in which it is 0, so the result is just the first time or last time multiplied by one, plus various zeros, or simply the first or last time.