Search
   >> Home >> Resources & support >> FAQs >> Replacing missing values

This FAQ is based on questions and answers that appeared on Statalist.

How can I replace missing values with previous or following nonmissing values or within sequences?

Title   Replacing missing values
Author Nicholas J. Cox, Durham University, UK
Date August 2000; updated January 2012

1. The problems

Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order. Typically, this occurs when values of some variable should be identical within blocks of observations, but, for some reason, values are explicitly nonmissing within the dataset only for certain observations, most often the first. So, there is a wish to copy values within blocks of observations.

Alternatively, users often want to replace missing values in a sequence, usually in a time sequence. These problems can be solved with similar methods.

A different situation, not addressed directly in this FAQ, is when values of some time-varying variable are known only for certain observations. There is then a need for imputation or interpolation between known values. Copying the last value forward is unlikely to be a good method of interpolation unless, as just stated, it is known that values remained constant at a stated level until the next stated level. Either way, users applying the methods described here for imputation or interpolation take on the responsibility for what they do.

2. Without tsset: copying nonmissing values

Let us first look at the case where you have not tsset your data (see, for example, [TS] tsset for an explanation), but we will assume that the data have been put in the correct sort order, say, by typing

        . sort time 

If missing values occurred singly, then they could be replaced by the previous value

        . replace myvar = myvar[_n-1] if missing(myvar) 

or by the following value

        . replace myvar = myvar[_n+1] if missing(myvar) 

Here the subscript notation used is that _n always refers to any given observation, _n−1 to the previous observation and _n+1 to the following observation, given the current sort order. There is not, of course, any observation before the first, or after the last, so myvar[0] is always missing, as is myvar for any observation number that is negative or greater than the number of observations in the data. See [U] 13.7 Explicit subscripting for more about subscripting.

missing(myvar) catches both numeric missings and string missings. If myvar is numeric, you could write

        . replace myvar = myvar[_n+1] if myvar >= . 

because . < .a < .b < ... < .z are the numeric missing values. Most problems involve missing numeric values, so, from now on, examples will be for numeric variables only. However, if myvar were string,

        . replace myvar = myvar[_n+1] if myvar == "" 

would be correct syntax, not the previous command, because the empty string "" is string missing.

3. Copying previous values downwards: the cascade effect

Missing values may occur in blocks of two or more. Suppose you want to replace missings by the previous nonmissing value, whenever it occurred, so that given

 _n      myvar 

 1         42
 2          .
 3          .
 4         56 
 5         67
 6         78 

you want to replace not only myvar[2], but also myvar[3] with 42.

        . replace myvar = 42 in 2/3 

is an interactive solution, but, for larger datasets, you need a more systematic way of proceeding. To get this, it helps to know that replace always uses the current sort order: the value for observation 2 is always replaced before that for observation 3, so the replacement value for 2 may be used in calculating the replacement value for 3.

        . replace myvar = myvar[_n-1] if myvar >= . 

achieves this purpose. myvar[1] is unchanged, because myvar[1] is not missing. myvar[2] is replaced by the value of myvar[1], namely, 42, because myvar[2] is missing. But myvar[3] is replaced by the new value of myvar[2], 42, not its original value, missing (.). In this way, nonmissing values are copied in a cascade down the current sort order. Naturally, one or more missing values at the start of the data cannot be replaced in this way, as no nonmissing value precedes any of them.

What if you want to use the previous value only and do not want this cascade effect? You need to copy the variable and replace from that:

        . gen mycopy = myvar 
        . replace myvar = mycopy[_n-1] if myvar >= . 

No replacement is being made in mycopy, so there is no cascade effect. replace just looks across at mycopy and back one observation.

4. Copying following values upwards

The opposite case is replacement by following values, but, because replace respects the current sort order, this is not just the mirror image of replacement by previous values. In practice, it is easiest to reverse the series and work the other way.

        . gsort -time 
        . replace myvar = myvar[_n-1] if myvar >= .

gsort allows you to get reverse sort order; see [D] gsort. The command sort time puts highest values last, whereas gsort −time puts highest values first. It is as if you had generated a variable that was time multiplied by −1 and sorted on it, and, in fact, this is exactly what gsort does behind the scenes, although the variable is temporary and dropped after it has served its purpose.

        . replace myvar = myvar[_n+1] if myvar >= . 

does not produce a cascade effect. myvar[2] would be replaced by existing myvar[3], myvar[3] would be replaced by existing myvar[4], and so forth. At most, one of any block of missing values would be replaced. This might, of course, be exactly what you want.

Once again, nothing can be done about any missing values at the end of the series (placed at the beginning after the gsort). After replacement, you will probably want to reverse the sorting once again by

        . sort time 

5. Complications: several variables and panel structure

Two common complications are

Suppose that individuals are identified by id. There are just a few extra details to review, such as

        . by id (time), sort: replace myvar = myvar[_n-1] if myvar >= . 

or

        . gsort id -time 
        . quietly by id: replace myvar = myvar[_n-1] if myvar >= . 
        . sort id time 

The key to many data management problems with panel data lies in following sort by some computations under by:. For more information, see the sections of the manual indexed under by:.

6. With tsset

If you have tsset your data, say, by typing

        . tsset time 

then

        . replace myvar = L.myvar if myvar >= . 

has the effect of copying in cascade, whereas

        . replace myvar = F.myvar if myvar >=. 

has no such effect. The value of tsset is that it takes account of gaps in your data and (if you had declared a panel variable) of any panel structure to your data.

7. Missing values in sequences

In some datasets, time variables come with gaps, something like

 _n      year 

 1          . 
 2          .
 3       1990
 4          . 
 5          .
 6          . 
 7          .
 8       1995
 9          .
 10         . 

We can use a similar method and rely on cascading:

        . replace year = 1988 in 1 
        . replace year = year[_n-1] + 1 if missing(year) 

The difference is simply that each value is one more than the previous one. If data were once per decade, each value would be 10 more, and so forth. Again missing values at the beginning of a sequence need special surgery, as shown here. With tsset panel data use L.year + 1 rather than year[_n-1] + 1.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube