How can I drop spells of missing values at the beginning and end of panel data?
|
Title
|
|
Dropping spells of missing values
|
|
Author
|
Nicholas J. Cox, Durham University, UK
Gary Longton, Fred Hutchinson Cancer Research Center
|
|
Date
|
September 2003; updated March 2007
|
1. The problem
I have panel data (or longitudinal data or cross-sectional time-series data)
containing missing values. I wish to drop any observations at the beginning
or end of each panel containing just missing values. How do I do this?
2. Example and analysis of the problem
Let us be clear about what the problem is. With panel data, we have one or
more panels with identifiers and a time variable. So, a panel might look
like
id time response
1 1 .
1 2 .
1 3 2.3
1 4 3.14159
1 5 .
1 6 5.6
1 7 6.7
1 8 7.8
1 9 8.9
1 10 10.1
2 1 2
2 2 2.71828
2 3 5
2 4 7
2 5 11
2 6 .
2 7 17
2 8 19
2 9 .
2 10 .
Here we have missing values at the beginning of panel 1 and at
the end of panel 2. To save memory, especially with a large file, we
might want to drop spells of missing values at the beginning of each
panel, at the end, or both. Also, missing values may occur in the middle of each panel, that is, in observations not contiguous with blocks at the beginning or at the end.
Therefore, the answer is definitely not
. drop if response >= .
or
. drop if missing(response)
because either of these would drop missing values in the middle of each
panel, which would be going too far. That is what gives this problem its
twist.
For what follows, missing() takes on the value 1 for true
if its argument is missing and the value 0 for false, otherwise. It can be
abbreviated mi(). Also, !mi() negates or
reverses the mapping: !mi(.) is 0, as missing is not (not missing),
and !mi(42) is 1, as 42 is not missing.
3. Identifying the kind of solution needed
For many problems like this, there is a simple Stata solution, which we will
naturally reveal to you in a moment. More important, however, is how you
can work out the solution to these and similar problems yourselves.
Two elements are immediate. First, the panel structure is crucial here.
Separately for each panel, we must identify any spells of missing values at
the start or end of that panel. To experienced Stata users, this should
suggest doing things by varlist:, here by
id:. For more on the syntax, see
help by, check out sections in
the manual on by:, or read the tutorial in Cox (2002). What can seem strange at first
sight is that absolutely no looping is needed here. Many Stata users,
especially if they have much experience using loops in other languages,
tend to think about problems like this in terms of looping over the panels
and then over the times within each panel, but in this case and in many
others, simpler and faster code avoids all that. More precisely, code can be
found that does the looping implicitly, with all the details managed for
you.
Second, sort order within panels is also crucial. We must work through
values, respecting the order of the time variable.
Although we talk about panel data, we nowhere assume that you
have declared your dataset as panel data to Stata by using
tsset.
That is often a good idea and does no harm here, but it is irrelevant
to what follows.
4. Particular solutions: Missing values at the beginning
One idea to start on the problem of spells of missing values at the
beginning of each panel is to go through the panel and keep a running tally
of the number of missing values seen so far. In the first panel of our
example,
id time response
1 1 .
1 2 .
1 3 2.3
1 4 3.14159
1 5 .
1 6 5.6
1 7 6.7
1 8 7.8
1 9 8.9
1 10 10.1
this would be 1, 2, 2, 2, 3, 3, 3, 3, 3, 3. To get the running tally of the
number of missing values seen so far, we calculate sum(mi(response)).
This leads us to the solution
. by id (time), sort: drop if _n == sum(mi(response))
That appeared like a rabbit out of a hat, so let's subtract the mystery. We
want to do things by panel
. by id:
and within-panel (within id) observations must be sorted on the time
variable time, which can all be telescoped to
. by id (time), sort:
so we just need to understand drop if _n == sum(mi(response)).
We identified each observation within a panel by using _n. There is
a surprise, perhaps: why didn't we use time, which may well seem the
obvious and natural identifier? For one good reason: although in this
example the time variable time is nicely behaved, going from 1
to 10 with no gaps, there is no guarantee that this is true in general. For
general code, we use _n. Crucially, under by
varlist:, _n is defined within the distinct groups
defined by varlist, meaning here the panels. Therefore _n of 1
always identifies the first observation in each panel in the current sort
order of that panel, and here that is sorted by time, which is what
by id (time), sort: ensures.
Hence, _n always increases from 1 upward within each panel. If the
panel starts with missing values, then sum(mi(response)) does the
same. As soon as we hit a nonmissing value then sum(mi(response))
drops below _n and will remain below it. Thus our criterion for
dropping values is if _n == sum(mi(response)).
We can test this by considering the opposite case. If the panel starts with
a nonmissing value, then sum(mi(response)) starts at 0, namely, below
_n, which is 1, and it can never equal _n again within that
panel.
Another way of using the same idea is to phrase matters in terms of
sum(!mi(response)). That starts at 0 if the first value of
response is missing (because it is not [not missing]) and remains 0
so long as values are missing. It starts at 1 if the first value of
response is indeed not missing. So, the criterion is
. by id (time), sort: drop if sum(!mi(response)) == 0
We prefer this formulation, even though it may seem a little backward to
phrase the problem in terms of nonmissing values.
5. Particular solutions: Missing values at the end
Missing values at the beginning of each panel were tallied in the order of
the dataset as a whole, from first observation to last observation; in
particular, this is the way cumulative sums are calculated using
sum(). More generally, Stata always works in this direction unless
otherwise stipulated. For other applications of that principle, see the FAQ
How can I replace missing values
with previous or following nonmissing values? How can I replace missing
values within sequences?
To consider values at the end of each panel, we need to start at the
end and work backward. By far the easiest way to do that is just to reverse
the sort order within each panel, and then apply the same logic as before.
You could change the sort order this way:
. gen ntime = -time
. by id (ntime), sort: drop if sum(!mi(response)) == 0
. drop ntime
Another way is to do it with
gsort:
. gsort id -time
. by id: drop if sum(!mi(response)) == 0
Either way, you usually want to clean up the sort order again before other
work by a plain tsset if you did do a tsset earlier, by
. sort id time
or by
. tsset id time
For other uses of the reversed time device, see the FAQ mentioned earlier.
6. Particular solutions: Using egen
A different approach is to use
egen, which gives a pleasantly direct solution. The first and last
times at which response is not missing are given by
. by id, sort: egen firstnonmissing = min(cond(!missing(response), time, .))
. by id: egen lastnonmissing = max(cond(!missing(response), time, .))
The key to this approach is to realize egen, min() and egen,
max() can take expressions, here using the cond() function that
yields either time when response is or is not missing.
We are exploiting the fact that Stata ignores missings in
calculating extremes. Thus the first and last times reported for a panel
will be missing only if response is always missing for that panel,
which is what we would want anyway.
Given first and last times, observations may be dropped by one line
. drop if time < firstnonmissing | time > lastnonmissing
You might also want to drop both firstnonmissing and
lastnonmissing if they serve no further purpose. A tacit assumption
here is that time takes distinct values within each panel, which
seems likely and is in any case essential if tsset is to be applied.
In the extreme case in which all values of response are missing
within a panel, firstnonmissing will be returned as missing and all
the corresponding observations will be dropped, which is fine.
Reference
- Cox, N. J. 2002.
-
Speaking Stata: How to move step by: step
Stata Journal. 2: 86–102.
|