The following material is based on exchanges
on Statalist.

This FAQ is for users of Stata 7. It is not relevant for Stata 8, which includes the**tssmooth**
command for calculating moving averages and other kinds of smooth summary.

This FAQ is for users of Stata 7. It is not relevant for Stata 8, which includes the

Title | Stata 7: Moving averages for panel data | |

Authors |
Nicholas J. Cox, Durham University, UK Christopher Baum, Boston College |

Stata’s most obvious command for calculating moving averages is the
**ma()** function of **egen**. Given an
expression, it creates a #-period moving average of that expression. By
default, # is taken as 3. # must be odd.

However, as the manual entry indicates, **egen, ma()** may not be
combined with **by** *varlist***:**, and, for that reason alone, it
is not applicable to panel data. In any case, it stands outside the set of
commands specifically written for time series; see
**time series** for details.

To calculate moving averages for panel data, there are at least
two choices. Both depend upon the dataset having been
**tsset** beforehand.
This is very much worth doing: not only can you save yourself
repeatedly specifying panel variable and time variable, but Stata
behaves smartly given any gaps in the data.

Using time-series operators such as **L.** and **F.**, give the
definition of the moving average as the argument to a **generate**
statement. If you do this, you are, naturally, not limited to the equally
weighted (unweighted) centered moving averages calculated by **egen,
ma()**.

For example, equally-weighted three-period moving averages would be given by

. generate moveave1 = (F1.myvar + myvar + L1.myvar) / 3

and some weights can easily be specified:

. generate moveave2 = (F1.myvar + 2 * myvar + L1.myvar) / 4

You can, of course, specify an expression such as **log(myvar)** instead
of a variable name such as **myvar**.

One big advantage of this approach is that Stata automatically does the right thing for panel data: leading and lagging values are worked out within panels, just as logic dictates they should be. The most notable disadvantage is that the command line can get rather long if the moving average involves several terms.

Another example is a one-sided moving average based only on previous values. This could be useful for generating an adaptive expectation of what a variable will be based purely on information to date: what could someone forecast for the current period based on the past four values, using a fixed weighting scheme? (A 4-period lag might be especially commonly used with quarterly timeseries.)

. generate moveave3 = 0.4*L1.myvar + 0.3*L2.myvar + 0.2*L3.myvar + 0.1*L4.myvar

Use the community-contributed **egen** function **filter()** from the
**egenmore** package on SSC. In Stata 7 (updated after 14 November 2001),
you can install this package by

. ssc inst egenmore

after which **help egenmore** points to details on **filter()**. The
two examples above would be rendered

. egen moveave1 = filter(myvar), coef(1 1 1) lags(-1/1) normalise . egen moveave2 = filter(myvar), coef(1 2 1) lags(-1/1) normalise

(In this comparison the **generate** approach is perhaps more transparent,
but we will see an example of the opposite in a moment.) The **lags** are a
**numlist**, leads
being negative lags: in this case **-1/1** expands to **-1 0 1** or
lead 1, lag 0, lag 1. The **coef**ficients, another numlist, multiply the
corresponding lagging or leading items: in this case those items are
**F1.myvar**, **myvar** and **L1.myvar**. The effect of the
**normalise** option is to scale each coefficient by the sum of the
coefficients so that **coef(1 1 1) normalise** is equivalent to
coefficients of 1/3 1/3 1/3 and **coef(1 2 1) normalise** is equivalent
to coefficients of 1/4 1/2 1/4.

You must specify not only the lags but also the coefficients. Because
**egen, ma()** provides the equally weighted case, the main rationale for
**egen, filter()** is to support the unequally weighted case, for which
you must specify coefficients. It could also be said that obliging users to
specify coefficients is a little extra pressure on them to think about what
coefficients they want. The main justification for equal weights is, we
guess, simplicity, but equal weights have lousy frequency domain properties,
to mention just one consideration.

The third example above could be

. egen moveave3 = filter(myvar), coef(0.4 0.3 0.2 0.1) lags(1/4)

or

. egen moveave3 = filter(myvar), coef(4 3 2 1) lags(1/4) normalise

either of which is just about as complicated as the **generate**
approach. There are cases in which **egen, filter()** gives a simpler
formulation than **generate**. If you want a nine-term binomial filter,
which climatologists find useful, then

. egen binomial9 = filter(myvar), coef(1 8 28 56 70 56 28 8 1) lags(-4/4) > normalise

looks perhaps less horrible than, and easier to get right than,

. gen binomial9 = (F4.myvar + 8 * F3.myvar + 28 * F2.myvar + 56 * F1.myvar + > 70 * myvar + 56 * L1.myvar + 28 * L2.myvar + 8 * L3.myvar + L4.myvar) / 256

Just as with the **generate** approach, **egen, filter()** works
properly with panel data. In fact, as stated above, it depends upon the
dataset having been **tsset** beforehand.

After calculating your moving averages, you will probably want to look at a
graph. The community-contributed command **tsgraph** is smart about **tsset**
datasets. Install it in an up-to-date Stata 7 by **ssc inst tsgraph**.

None of the above examples make use of **if** restrictions. In fact
**egen, ma()** will not allow **if** to be specified. Occasionally
people want to use **if** when calculating moving averages, but its use
is a little more complicated than it is usually.

What would you expect from a moving average calculated with **if**? Let
us identify two possibilities:

- Weak interpretation: I don't want to see any results for the excluded observations.
- Strong interpretation: I don't even want you to use the values for the excluded observations.

Here is a concrete example. Suppose as a consequence of some **if**
condition, observations 1-42 are included but not observations 43 on. But
the moving average for 42 will depend, among other things, on the value for
observation 43 if the average extends backwards and forwards and is of
length at least 3, and it will similarly depend on some of the observations
44 onwards in some circumstances.

Our guess is that most people would go for the weak interpretation, but
whether that is correct, **egen, filter()** does not support
**if** either. You can always ignore what you don’t want or even
set unwanted values to missing afterwards by using **replace**.

Because moving averages are functions of lags and leads, **egen, ma()**
produces missing where the lags and leads do not exist, at the beginning and
end of the series. An option **nomiss** forces the calculation of
shorter, uncentered moving averages for the tails.

In contrast, neither **generate** nor **egen, filter()** does, or
allows, anything special to avoid missing results. If any of the values
needed for calculation is missing, then that result is missing. It is up to
users to decide whether and what corrective surgery is required for such
observations, presumably after looking at the dataset and considering any
underlying `science' that can be brought to bear.