|Title||Stata 7: Moving averages for panel data|
Nicholas J. Cox, Durham University, UK
Christopher Baum, Boston College
Stata’s most obvious command for calculating moving averages is the ma() function of egen. Given an expression, it creates a #-period moving average of that expression. By default, # is taken as 3. # must be odd.
However, as the manual entry indicates, egen, ma() may not be combined with by varlist:, and, for that reason alone, it is not applicable to panel data. In any case, it stands outside the set of commands specifically written for time series; see time series for details.
To calculate moving averages for panel data, there are at least two choices. Both depend upon the dataset having been tsset beforehand. This is very much worth doing: not only can you save yourself repeatedly specifying panel variable and time variable, but Stata behaves smartly given any gaps in the data.
Using time-series operators such as L. and F., give the definition of the moving average as the argument to a generate statement. If you do this, you are, naturally, not limited to the equally weighted (unweighted) centered moving averages calculated by egen, ma().
For example, equally-weighted three-period moving averages would be given by
. generate moveave1 = (F1.myvar + myvar + L1.myvar) / 3
and some weights can easily be specified:
. generate moveave2 = (F1.myvar + 2 * myvar + L1.myvar) / 4
You can, of course, specify an expression such as log(myvar) instead of a variable name such as myvar.
One big advantage of this approach is that Stata automatically does the right thing for panel data: leading and lagging values are worked out within panels, just as logic dictates they should be. The most notable disadvantage is that the command line can get rather long if the moving average involves several terms.
Another example is a one-sided moving average based only on previous values. This could be useful for generating an adaptive expectation of what a variable will be based purely on information to date: what could someone forecast for the current period based on the past four values, using a fixed weighting scheme? (A 4-period lag might be especially commonly used with quarterly timeseries.)
. generate moveave3 = 0.4*L1.myvar + 0.3*L2.myvar + 0.2*L3.myvar + 0.1*L4.myvar
Use the community-contributed egen function filter() from the egenmore package on SSC. In Stata 7 (updated after 14 November 2001), you can install this package by
. ssc inst egenmore
after which help egenmore points to details on filter(). The two examples above would be rendered
. egen moveave1 = filter(myvar), coef(1 1 1) lags(-1/1) normalise . egen moveave2 = filter(myvar), coef(1 2 1) lags(-1/1) normalise
(In this comparison the generate approach is perhaps more transparent, but we will see an example of the opposite in a moment.) The lags are a numlist, leads being negative lags: in this case -1/1 expands to -1 0 1 or lead 1, lag 0, lag 1. The coefficients, another numlist, multiply the corresponding lagging or leading items: in this case those items are F1.myvar, myvar and L1.myvar. The effect of the normalise option is to scale each coefficient by the sum of the coefficients so that coef(1 1 1) normalise is equivalent to coefficients of 1/3 1/3 1/3 and coef(1 2 1) normalise is equivalent to coefficients of 1/4 1/2 1/4.
You must specify not only the lags but also the coefficients. Because egen, ma() provides the equally weighted case, the main rationale for egen, filter() is to support the unequally weighted case, for which you must specify coefficients. It could also be said that obliging users to specify coefficients is a little extra pressure on them to think about what coefficients they want. The main justification for equal weights is, we guess, simplicity, but equal weights have lousy frequency domain properties, to mention just one consideration.
The third example above could be
. egen moveave3 = filter(myvar), coef(0.4 0.3 0.2 0.1) lags(1/4)
. egen moveave3 = filter(myvar), coef(4 3 2 1) lags(1/4) normalise
either of which is just about as complicated as the generate approach. There are cases in which egen, filter() gives a simpler formulation than generate. If you want a nine-term binomial filter, which climatologists find useful, then
. egen binomial9 = filter(myvar), coef(1 8 28 56 70 56 28 8 1) lags(-4/4) > normalise
looks perhaps less horrible than, and easier to get right than,
. gen binomial9 = (F4.myvar + 8 * F3.myvar + 28 * F2.myvar + 56 * F1.myvar + > 70 * myvar + 56 * L1.myvar + 28 * L2.myvar + 8 * L3.myvar + L4.myvar) / 256
Just as with the generate approach, egen, filter() works properly with panel data. In fact, as stated above, it depends upon the dataset having been tsset beforehand.
After calculating your moving averages, you will probably want to look at a graph. The community-contributed command tsgraph is smart about tsset datasets. Install it in an up-to-date Stata 7 by ssc inst tsgraph.
None of the above examples make use of if restrictions. In fact egen, ma() will not allow if to be specified. Occasionally people want to use if when calculating moving averages, but its use is a little more complicated than it is usually.
What would you expect from a moving average calculated with if? Let us identify two possibilities:
Here is a concrete example. Suppose as a consequence of some if condition, observations 1-42 are included but not observations 43 on. But the moving average for 42 will depend, among other things, on the value for observation 43 if the average extends backwards and forwards and is of length at least 3, and it will similarly depend on some of the observations 44 onwards in some circumstances.
Our guess is that most people would go for the weak interpretation, but whether that is correct, egen, filter() does not support if either. You can always ignore what you don’t want or even set unwanted values to missing afterwards by using replace.
Because moving averages are functions of lags and leads, egen, ma() produces missing where the lags and leads do not exist, at the beginning and end of the series. An option nomiss forces the calculation of shorter, uncentered moving averages for the tails.
In contrast, neither generate nor egen, filter() does, or allows, anything special to avoid missing results. If any of the values needed for calculation is missing, then that result is missing. It is up to users to decide whether and what corrective surgery is required for such observations, presumably after looking at the dataset and considering any underlying `science' that can be brought to bear.