The following material is based on exchanges
on Statalist.
This FAQ is for users of Stata 7. It is not relevant for Stata 8, which
includes the tssmooth
command for calculating moving averages and other kinds of smooth summary.
Stata 7: How can I calculate moving averages for panel data?
|
Title
|
|
Stata 7: Moving averages for panel data
|
|
Authors
|
Nicholas J. Cox, Durham University, UK
Christopher Baum, Boston College
|
|
Date
|
December 2001
|
egen, ma() and its limitations
Stata’s most obvious command for calculating moving averages is the
ma() function of egen. Given an
expression, it creates a #-period moving average of that expression. By
default, # is taken as 3. # must be odd.
However, as the on-line help indicates, egen, ma() may not be
combined with by varlist:, and, for that reason alone, it
is not applicable to panel data. In any case, it stands outside the set of
commands specifically written for time series; see help
time for details.
Alternative approaches
To calculate moving averages for panel data, there are at least
two choices. Both depend upon the dataset having been
tsset beforehand.
This is very much worth doing: not only can you save yourself
repeatedly specifying panel variable and time variable, but Stata
behaves smartly given any gaps in the data.
1. Write your own definition using generate
Using time-series operators such as L. and F., give the
definition of the moving average as the argument to a generate
statement. If you do this, you are, naturally, not limited to the equally
weighted (unweighted) centered moving averages calculated by egen,
ma().
For example, equally-weighted three-period moving averages would be given by
. generate moveave1 = (F1.myvar + myvar + L1.myvar) / 3
and some weights can easily be specified:
. generate moveave2 = (F1.myvar + 2 * myvar + L1.myvar) / 4
You can, of course, specify an expression such as log(myvar) instead
of a variable name such as myvar.
One big advantage of this approach is that Stata automatically does the
right thing for panel data: leading and lagging values are worked out within
panels, just as logic dictates they should be. The most notable disadvantage
is that the command line can get rather long if the moving average involves
several terms.
Another example is a one-sided moving average based only on previous values.
This could be useful for generating an adaptive expectation of what a
variable will be based purely on information to date: what could someone
forecast for the current period based on the past four values, using a fixed
weighting scheme? (A 4-period lag might be especially commonly used with
quarterly timeseries.)
. generate moveave3 = 0.4*L1.myvar + 0.3*L2.myvar + 0.2*L3.myvar + 0.1*L4.myvar
2. Use egen, filter() from SSC
Use the user-written egen function filter() from the
egenmore package on SSC. In Stata 7 (updated after 14 November 2001),
you can install this package by
. ssc inst egenmore
after which help egenmore points to details on filter(). The
two examples above would be rendered
. egen moveave1 = filter(myvar), coef(1 1 1) lags(-1/1) normalise
. egen moveave2 = filter(myvar), coef(1 2 1) lags(-1/1) normalise
(In this comparison the generate approach is perhaps more transparent,
but we will see an example of the opposite in a moment.) The lags are a
numlist, leads
being negative lags: in this case -1/1 expands to -1 0 1 or
lead 1, lag 0, lag 1. The coefficients, another numlist, multiply the
corresponding lagging or leading items: in this case those items are
F1.myvar, myvar and L1.myvar. The effect of the
normalise option is to scale each coefficient by the sum of the
coefficients so that coef(1 1 1) normalise is equivalent to
coefficients of 1/3 1/3 1/3 and coef(1 2 1) normalise is equivalent
to coefficients of 1/4 1/2 1/4.
You must specify not only the lags but also the coefficients. Because
egen, ma() provides the equally weighted case, the main rationale for
egen, filter() is to support the unequally weighted case, for which
you must specify coefficients. It could also be said that obliging users to
specify coefficients is a little extra pressure on them to think about what
coefficients they want. The main justification for equal weights is, we
guess, simplicity, but equal weights have lousy frequency domain properties,
to mention just one consideration.
The third example above could be
. egen moveave3 = filter(myvar), coef(0.4 0.3 0.2 0.1) lags(1/4)
or
. egen moveave3 = filter(myvar), coef(4 3 2 1) lags(1/4) normalise
either of which is just about as complicated as the generate
approach. There are cases in which egen, filter() gives a simpler
formulation than generate. If you want a nine-term binomial filter,
which climatologists find useful, then
. egen binomial9 = filter(myvar), coef(1 8 28 56 70 56 28 8 1) lags(-4/4)
> normalise
looks perhaps less horrible than, and easier to get right than,
. gen binomial9 = (F4.myvar + 8 * F3.myvar + 28 * F2.myvar + 56 * F1.myvar +
> 70 * myvar + 56 * L1.myvar + 28 * L2.myvar + 8 * L3.myvar + L4.myvar) / 256
Just as with the generate approach, egen, filter() works
properly with panel data. In fact, as stated above, it depends upon the
dataset having been tsset beforehand.
A graphical tip
After calculating your moving averages, you will probably want to look at a
graph. The user-written command tsgraph is smart about tsset
datasets. Install it in an up-to-date Stata 7 by ssc inst tsgraph.
What about subsetting with if?
None of the above examples make use of if restrictions. In fact
egen, ma() will not allow if to be specified. Occasionally
people want to use if when calculating moving averages, but its use
is a little more complicated than it is usually.
What would you expect from a moving average calculated with if? Let
us identify two possibilities:
- Weak interpretation: I don't want to see any results for the excluded
observations.
- Strong interpretation: I don't even want you to use the values for the
excluded observations.
Here is a concrete example. Suppose as a consequence of some if
condition, observations 1-42 are included but not observations 43 on. But
the moving average for 42 will depend, among other things, on the value for
observation 43 if the average extends backwards and forwards and is of
length at least 3, and it will similarly depend on some of the observations
44 onwards in some circumstances.
Our guess is that most people would go for the weak interpretation, but
whether that is correct, egen, filter() does not support
if either. You can always ignore what you don’t want or even
set unwanted values to missing afterwards by using replace.
A note on missing results at the ends of series
Because moving averages are functions of lags and leads, egen, ma()
produces missing where the lags and leads do not exist, at the beginning and
end of the series. An option nomiss forces the calculation of
shorter, uncentered moving averages for the tails.
In contrast, neither generate nor egen, filter() does, or
allows, anything special to avoid missing results. If any of the values
needed for calculation is missing, then that result is missing. It is up to
users to decide whether and what corrective surgery is required for such
observations, presumably after looking at the dataset and considering any
underlying `science' that can be brought to bear.
|