|
This FAQ is based on questions and answers that appeared on
Statalist.
How can I replace missing values with previous or following nonmissing
values or within sequences?
|
Title
|
|
Replacing missing values
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
August 2000; updated January 2012
|
1. The problems
Users often want to replace missing values by neighboring nonmissing values,
particularly when observations occur in some definite order, often (but not
always) a time order. Typically, this occurs when values of some variable
should be identical within blocks of observations, but, for some reason,
values are explicitly nonmissing within the dataset only for certain
observations, most often the first. So, there is a wish to copy values
within blocks of observations.
Alternatively, users often want to replace missing values in a sequence,
usually in a time sequence. These problems can be solved with similar
methods.
A different situation, not addressed directly in this FAQ, is when values of
some time-varying variable are known only for certain observations. There is
then a need for imputation or interpolation between known values. Copying
the last value forward is unlikely to be a good method of interpolation
unless, as just stated, it is known that values remained
constant at a stated level until the next stated level. Either way, users
applying the methods described here for imputation or interpolation take on
the responsibility for what they do.
2. Without tsset: copying nonmissing values
Let us first look at the case where you have not
tsset your data
(see, for example, [TS] tsset for an explanation), but we will assume
that the data have been put in the correct sort order, say, by typing
. sort time
If missing values occurred singly, then they could be replaced by the
previous value
. replace myvar = myvar[_n-1] if missing(myvar)
or by the following value
. replace myvar = myvar[_n+1] if missing(myvar)
Here the subscript notation used is that _n always refers to any
given observation, _n−1 to the previous observation and
_n+1 to the following observation, given the current sort order.
There is not, of course, any observation before the first, or after the
last, so myvar[0] is always missing, as is myvar for any
observation number that is negative or greater than the number of
observations in the data. See [U] 13.7 Explicit subscripting for more
about subscripting.
missing(myvar) catches both numeric missings and string missings. If
myvar is numeric, you could write
. replace myvar = myvar[_n+1] if myvar >= .
because . < .a < .b < ... < .z are
the numeric missing values. Most problems involve missing numeric values, so,
from now on, examples will be for numeric variables only. However, if
myvar were string,
. replace myvar = myvar[_n+1] if myvar == ""
would be correct syntax, not the previous command, because the empty string
"" is string missing.
3. Copying previous values downwards: the cascade effect
Missing values may occur in blocks of two or more. Suppose you want to
replace missings by the previous nonmissing value, whenever it occurred, so
that given
_n myvar
1 42
2 .
3 .
4 56
5 67
6 78
you want to replace not only myvar[2], but also
myvar[3] with 42.
. replace myvar = 42 in 2/3
is an interactive solution, but, for larger datasets, you need a more
systematic way of proceeding. To get this, it helps to know that
replace always uses the current sort order: the value for observation
2 is always replaced before that for observation 3, so the replacement
value for 2 may be used in calculating the replacement value for 3.
. replace myvar = myvar[_n-1] if myvar >= .
achieves this purpose. myvar[1] is unchanged, because myvar[1]
is not missing. myvar[2] is replaced by the value of myvar[1],
namely, 42, because myvar[2] is missing. But myvar[3] is
replaced by the new value of myvar[2], 42, not its original value,
missing (.). In this way, nonmissing values are copied in a cascade down
the current sort order. Naturally, one or more missing values at the start
of the data cannot be replaced in this way, as no nonmissing value precedes
any of them.
What if you want to use the previous value only and do not want this cascade
effect? You need to copy the variable and replace from that:
. gen mycopy = myvar
. replace myvar = mycopy[_n-1] if myvar >= .
No replacement is being made in mycopy, so there is no cascade
effect. replace just looks across at mycopy and back one
observation.
4. Copying following values upwards
The opposite case is replacement by following values, but, because
replace respects the current sort order, this is not just the mirror
image of replacement by previous values. In practice, it is easiest to
reverse the series and work the other way.
. gsort -time
. replace myvar = myvar[_n-1] if myvar >= .
gsort
allows you to get reverse sort order. See the online help for
gsort or [D]
sort. The command sort time puts highest values last, whereas
gsort −time puts highest values first. It is as if you had
generated a variable that was time multiplied by −1 and sorted
on it, and, in fact, this is exactly what gsort does behind the
scenes, although the variable is temporary and dropped after it has served
its purpose.
. replace myvar = myvar[_n+1] if myvar >= .
does not produce a cascade effect. myvar[2] would be replaced by
existing myvar[3], myvar[3] would be replaced by existing
myvar[4], and so forth. At most, one of any block of missing values
would be replaced. This might, of course, be exactly what you want.
Once again, nothing can be done about any missing values at the end of the
series (placed at the beginning after the gsort). After replacement,
you will probably want to reverse the sorting once again by
. sort time
5. Complications: several variables and panel structure
Two common complications are
- You want to do this with several variables: use foreach.
sort or gsort once, replace all variables using
foreach, and, if necessary, sort back again.
- You have panel data, so the appropriate replacement is a neighboring
nonmissing value for each individual in the panel.
Suppose that individuals are identified by id. There are just a few extra
details to review, such as
. by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
or
. gsort id -time
. quietly by id: replace myvar = myvar[_n-1] if myvar >= .
. sort id time
The key to many data management problems with panel data lies in following
sort by some computations under by:. For more information, see
the sections of the manual indexed under by:.
6. With tsset
If you have tsset your data, say, by typing
. tsset time
then
. replace myvar = L.myvar if myvar >= .
has the effect of copying in cascade, whereas
. replace myvar = F.myvar if myvar >=.
has no such effect. The value of tsset is that it takes account of
gaps in your data and (if you had declared a panel variable) of any panel
structure to your data.
7. Missing values in sequences
In some datasets, time variables come with gaps, something like
_n year
1 .
2 .
3 1990
4 .
5 .
6 .
7 .
8 1995
9 .
10 .
We can use a similar method and rely on cascading:
. replace year = 1988 in 1
. replace year = year[_n-1] + 1 if missing(year)
The difference is simply that each value is one more than the previous one.
If data were once per decade, each value would be 10 more, and so forth.
Again missing values at the beginning of a sequence need special surgery, as
shown here. With tsset panel data use L.year + 1 rather than
year[_n-1] + 1.
|