Title | Dropping spells of missing values | |
Author |
Nicholas J. Cox, Durham University, UK Gary Longton, Fred Hutchinson Cancer Research Center |
I have panel data (or longitudinal data or cross-sectional time-series data) containing missing values. I wish to drop any observations at the beginning or end of each panel containing just missing values. How do I do this?
Let us be clear about what the problem is. With panel data, we have one or more panels with identifiers and a time variable. So, a panel might look like
id time response 1 1 . 1 2 . 1 3 2.3 1 4 3.14159 1 5 . 1 6 5.6 1 7 6.7 1 8 7.8 1 9 8.9 1 10 10.1 2 1 2 2 2 2.71828 2 3 5 2 4 7 2 5 11 2 6 . 2 7 17 2 8 19 2 9 . 2 10 .
Here we have missing values at the beginning of panel 1 and at the end of panel 2. To save memory, especially with a large file, we might want to drop spells of missing values at the beginning of each panel, at the end, or both. Also, missing values may occur in the middle of each panel, that is, in observations not contiguous with blocks at the beginning or at the end.
Therefore, the answer is definitely not
. drop if response >= .
or
. drop if missing(response)
because either of these would drop missing values in the middle of each panel, which would be going too far. That is what gives this problem its twist.
For what follows, missing() takes on the value 1 for true if its argument is missing and the value 0 for false, otherwise. It can be abbreviated mi(). Also, !mi() negates or reverses the mapping: !mi(.) is 0, as missing is not (not missing), and !mi(42) is 1, as 42 is not missing.
For many problems like this, there is a simple Stata solution, which we will naturally reveal to you in a moment. More important, however, is how you can work out the solution to these and similar problems yourselves.
Two elements are immediate. First, the panel structure is crucial here. Separately for each panel, we must identify any spells of missing values at the start or end of that panel. To experienced Stata users, this should suggest doing things by varlist:, here by id:. For more on the syntax, see by, check out sections in the manual on by:, or read the tutorial in Cox (2002).
What can seem strange at first sight is that absolutely no looping is needed here. Many Stata users, especially if they have much experience using loops in other languages, tend to think about problems like this in terms of looping over the panels and then over the times within each panel, but in this case and in many others, simpler and faster code avoids all that. More precisely, code can be found that does the looping implicitly, with all the details managed for you.
Second, sort order within panels is also crucial. We must work through values, respecting the order of the time variable.
Although we talk about panel data, we nowhere assume that you have declared your dataset as panel data to Stata by using tsset. That is often a good idea and does no harm here, but it is irrelevant to what follows.
One idea to start on the problem of spells of missing values at the beginning of each panel is to go through the panel and keep a running tally of the number of missing values seen so far. In the first panel of our example,
id time response 1 1 . 1 2 . 1 3 2.3 1 4 3.14159 1 5 . 1 6 5.6 1 7 6.7 1 8 7.8 1 9 8.9 1 10 10.1
this would be 1, 2, 2, 2, 3, 3, 3, 3, 3, 3. To get the running tally of the number of missing values seen so far, we calculate sum(mi(response)). This leads us to the solution
. by id (time), sort: drop if _n == sum(mi(response))
That appeared like a rabbit out of a hat, so let's subtract the mystery. We want to do things by panel
. by id:
and within-panel (within id) observations must be sorted on the time variable time, which can all be telescoped to
. by id (time), sort:
so we just need to understand drop if _n == sum(mi(response)). We identified each observation within a panel by using _n. There is a surprise, perhaps: why didn't we use time, which may well seem the obvious and natural identifier? For one good reason: although in this example the time variable time is nicely behaved, going from 1 to 10 with no gaps, there is no guarantee that this is true in general. For general code, we use _n. Crucially, under by varlist:, _n is defined within the distinct groups defined by varlist, meaning here the panels. Therefore _n of 1 always identifies the first observation in each panel in the current sort order of that panel, and here that is sorted by time, which is what by id (time), sort: ensures.
Hence, _n always increases from 1 upward within each panel. If the panel starts with missing values, then sum(mi(response)) does the same. As soon as we hit a nonmissing value then sum(mi(response)) drops below _n and will remain below it. Thus our criterion for dropping values is if _n == sum(mi(response)).
We can test this by considering the opposite case. If the panel starts with a nonmissing value, then sum(mi(response)) starts at 0, namely, below _n, which is 1, and it can never equal _n again within that panel.
Another way of using the same idea is to phrase matters in terms of sum(!mi(response)). That starts at 0 if the first value of response is missing (because it is not [not missing]) and remains 0 so long as values are missing. It starts at 1 if the first value of response is indeed not missing. So, the criterion is
. by id (time), sort: drop if sum(!mi(response)) == 0
We prefer this formulation, even though it may seem a little backward to phrase the problem in terms of nonmissing values.
Missing values at the beginning of each panel were tallied in the order of the dataset as a whole, from first observation to last observation; in particular, this is the way cumulative sums are calculated using sum(). More generally, Stata always works in this direction unless otherwise stipulated. For other applications of that principle, see the FAQ How can I replace missing values with previous or following nonmissing values? How can I replace missing values within sequences?
To consider values at the end of each panel, we need to start at the end and work backward. By far the easiest way to do that is just to reverse the sort order within each panel, and then apply the same logic as before.
You could change the sort order this way:
. gen ntime = -time . by id (ntime), sort: drop if sum(!mi(response)) == 0 . drop ntime
Another way is to do it with gsort:
. gsort id -time . by id: drop if sum(!mi(response)) == 0
Either way, you usually want to clean up the sort order again before other work by a plain tsset if you did do a tsset earlier, by
. sort id time
or by
. tsset id time
For other uses of the reversed time device, see the FAQ mentioned earlier.
A different approach is to use egen, which gives a pleasantly direct solution. The first and last times at which response is not missing are given by
. by id, sort: egen firstnonmissing = min(cond(!missing(response), time, .)) . by id: egen lastnonmissing = max(cond(!missing(response), time, .))
The key to this approach is to realize egen, min() and egen, max() can take expressions, here using the cond() function that yields either time when response is or is not missing. We are exploiting the fact that Stata ignores missings in calculating extremes. Thus the first and last times reported for a panel will be missing only if response is always missing for that panel, which is what we would want anyway.
Given first and last times, observations may be dropped by one line
. drop if time < firstnonmissing | time > lastnonmissing
You might also want to drop both firstnonmissing and lastnonmissing if they serve no further purpose. A tacit assumption here is that time takes distinct values within each panel, which seems likely and is in any case essential if tsset is to be applied.
In the extreme case in which all values of response are missing within a panel, firstnonmissing will be returned as missing and all the corresponding observations will be dropped, which is fine.