|Title||Replacing missing values|
|Author||Nicholas J. Cox, Durham University, UK|
Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order. Typically, this occurs when values of some variable should be identical within blocks of observations, but, for some reason, values are explicitly nonmissing within the dataset only for certain observations, most often the first. So, there is a wish to copy values within blocks of observations.
Alternatively, users often want to replace missing values in a sequence, usually in a time sequence. These problems can be solved with similar methods.
A different situation, not addressed directly in this FAQ, is when values of some time-varying variable are known only for certain observations. There is then a need for imputation or interpolation between known values. Copying the last value forward is unlikely to be a good method of interpolation unless, as just stated, it is known that values remained constant at a stated level until the next stated level. Either way, users applying the methods described here for imputation or interpolation take on the responsibility for what they do.
Note that all the interpolation methods mentioned here (and some others) are directly implemented in the community-contributed command mipolate (which is downloadable from SSC).
Let us first look at the case where you have not tsset your data (see, for example, [TS] tsset for an explanation), but we will assume that the data have been put in the correct sort order, say, by typing
. sort time
If missing values occurred singly, then they could be replaced by the previous value
. replace myvar = myvar[_n-1] if missing(myvar)
or by the following value
. replace myvar = myvar[_n+1] if missing(myvar)
Here the subscript notation used is that _n always refers to any given observation, _n−1 to the previous observation and _n+1 to the following observation, given the current sort order. There is not, of course, any observation before the first, or after the last, so myvar is always missing, as is myvar for any observation number that is negative or greater than the number of observations in the data. See [U] 13.7 Explicit subscripting for more about subscripting.
missing(myvar) catches both numeric missings and string missings. If myvar is numeric, you could write
. replace myvar = myvar[_n+1] if myvar >= .
because . < .a < .b < ... < .z are the numeric missing values. Most problems involve missing numeric values, so, from now on, examples will be for numeric variables only. However, if myvar were string,
. replace myvar = myvar[_n+1] if myvar == ""
would be correct syntax, not the previous command, because the empty string "" is string missing.
Missing values may occur in blocks of two or more. Suppose you want to replace missings by the previous nonmissing value, whenever it occurred, so that given
_n myvar 1 42 2 . 3 . 4 56 5 67 6 78
you want to replace not only myvar, but also myvar with 42.
. replace myvar = 42 in 2/3
is an interactive solution, but, for larger datasets, you need a more systematic way of proceeding. To get this, it helps to know that replace always uses the current sort order: the value for observation 2 is always replaced before that for observation 3, so the replacement value for 2 may be used in calculating the replacement value for 3.
. replace myvar = myvar[_n-1] if myvar >= .
achieves this purpose. myvar is unchanged, because myvar is not missing. myvar is replaced by the value of myvar, namely, 42, because myvar is missing. But myvar is replaced by the new value of myvar, 42, not its original value, missing (.). In this way, nonmissing values are copied in a cascade down the current sort order. Naturally, one or more missing values at the start of the data cannot be replaced in this way, as no nonmissing value precedes any of them.
What if you want to use the previous value only and do not want this cascade effect? You need to copy the variable and replace from that:
. gen mycopy = myvar . replace myvar = mycopy[_n-1] if myvar >= .
No replacement is being made in mycopy, so there is no cascade effect. replace just looks across at mycopy and back one observation.
The opposite case is replacement by following values, but, because replace respects the current sort order, this is not just the mirror image of replacement by previous values. In practice, it is easiest to reverse the series and work the other way.
. gsort -time . replace myvar = myvar[_n-1] if myvar >= .
gsort allows you to get reverse sort order; see [D] gsort. The command sort time puts highest values last, whereas gsort −time puts highest values first. It is as if you had generated a variable that was time multiplied by −1 and sorted on it, and, in fact, this is exactly what gsort does behind the scenes, although the variable is temporary and dropped after it has served its purpose.
. replace myvar = myvar[_n+1] if myvar >= .
does not produce a cascade effect. myvar would be replaced by existing myvar, myvar would be replaced by existing myvar, and so forth. At most, one of any block of missing values would be replaced. This might, of course, be exactly what you want.
Once again, nothing can be done about any missing values at the end of the series (placed at the beginning after the gsort). After replacement, you will probably want to reverse the sorting once again by
. sort time
Two common complications are
Suppose that individuals are identified by id. There are just a few extra details to review, such as
. by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
. gsort id -time . quietly by id: replace myvar = myvar[_n-1] if myvar >= . . sort id time
The key to many data management problems with panel data lies in following sort by some computations under by:. For more information, see the sections of the manual indexed under by:.
If you have tsset your data, say, by typing
. tsset time
. replace myvar = L.myvar if myvar >= .
has the effect of copying in cascade, whereas
. replace myvar = F.myvar if myvar >=.
has no such effect. The value of tsset is that it takes account of gaps in your data and (if you had declared a panel variable) of any panel structure to your data.
In some datasets, time variables come with gaps, something like
_n year 1 . 2 . 3 1990 4 . 5 . 6 . 7 . 8 1995 9 . 10 .
We can use a similar method and rely on cascading:
. replace year = 1988 in 1 . replace year = year[_n-1] + 1 if missing(year)
The difference is simply that each value is one more than the previous one. If data were once per decade, each value would be 10 more, and so forth. Again missing values at the beginning of a sequence need special surgery, as shown here. With tsset panel data use L.year + 1 rather than year[_n-1] + 1.