Home  /  Resources & support  /  FAQs  /  Generating variables relating panel data to a reference panel
Note: The following material is based on postings to Statalist.

How can I generate a variable relating panel data to a reference panel?

Title   Generating variables relating panel data to a reference panel
Author Nicholas J. Cox, Durham University, UK
Scott Merryman, USDA

Question:

I have panel data. One panel (country, company, person, whatever) serves as a reference panel. How do I relate values for other panels to that reference panel, say, as a ratio (this panel’s value for this time) / (reference panel’s value for this time)? Or, how do I work with one time (1700, 1952, whatever) as a reference time?

Answer:


1. Identify the variable of interest and the reference

To fix ideas, we will work with a panel dataset, which may be downloaded from the Stata website:

. use http://www.stata-press.com/data/r18/grunfeld, clear

This dataset includes economic data on 10 anonymous companies for 20 years, 1935–54. The panel structure has been specified with xtset. The panel variable is company, and the time variable is year. Formally, setting the panel structure using xtset is not essential to what follows, but it is a good idea anyway.

The company designated 1 has highest mvalue throughout this period, so we will use this as the reference company to illustrate technique and will work with mvalue for illustration.

Later, we will show how the same Stata logic can be used when you want to work with a reference time. Substantively, that problem is different, but the program logic is identical.

Selecting company 1’s values is easy:

. generate mvalue_ref = mvalue if company == 1

The natural consequence of this command is to put missing values into the observations of mvalue_ref for other companies.

2. Spread the reference’s values to other observations

The next step, and the one that at first sight appears trickiest, is to spread those values to other companies. Let us imagine the results of a sort first on year and then within year on mvalue_ref. For example, the first value of year is 1935, and within all the values for 1935, the 10 companies would be sorted according to mvalue_ref. Now, at most, one value for mvalue_ref for 1935 will be nonmissing, the value for company 1; all others will be missing, as just mentioned. The result of a sort on numeric values is always to place missing values last, after any nonmissing values. In short, for each year, any nonmissing value will always end up as the first value. We can exploit this fact to spread the reference panel’s values to other panels:

. bysort year (mvalue_ref): replace mvalue_ref = mvalue_ref[1]

This statement does a lot at once, which can be unpacked as follows:

  • sort on year and then within year on mvalue_ref
  • For each block of data for each year, replace mvalue_ref with the first value it takes in that block. (Under by varlist:, subscripts such as [1] are interpreted within each block, not with reference to the dataset as a whole.)

For more on bysort, see the manuals or the tutorial at Cox (2002).

3. Calculate the desired quantity

Now we can calculate whatever we want, say,

. generate mvalue_sc = mvalue/mvalue_ref

or the logarithm of that, or a difference.

4. What could go wrong with this approach?

This example dataset appears to be in excellent condition. There are no missing values in the dataset, and it is complete with all companies represented for all years. Two things often go wrong with other panel datasets.

First, suppose that mvalue were in fact missing for 1935 and for company 1. Then all 10 values of mvalue_ref would be born missing for 1935 and would remain so after sorting and then replacement. Some missing value would get sorted to first position within 1935 (it does not matter which, as they are all identical), and a missing would get overwritten by a missing. This seems fair.

Second, suppose mvalue for 1935 and company 1 were, in fact, omitted from the dataset. Then all nine values of mvalue_ref for companies 2–10 would be born missing for 1935, and the result would resemble that in the previous situation, although for a different reason.

Thus the two main problems have reasonable consequences and do not affect observations except for the year to which they apply.

5. How do we apply this approach to a reference time?

The same Stata logic applies when we wish to work with a reference time. Suppose for this dataset that we wish to compare values with those in 1950. First, use the values for 1950 only:

. generate mvalue_ref = mvalue if year == 1950

Then spread that value to the other observations in each panel:

. bysort company (mvalue_ref): replace mvalue_ref = mvalue_ref[1]

You might be tempted to do something like this:

. bysort company (year): gen mvalue_ref = mvalue[16]

That approach hinges on noticing that the dataset in question is a balanced panel, in which each company is observed for all years from 1935 to 1954. Thus, when each panel is in year order, the value for 1950 is the 16th in each panel. The solution works for this example, and similar solutions will work whenever you have balanced panels, but it is not good style. First, in a log file the code above may not be transparent unless you add a comment, since the implication of using observation 16 may well not be obvious on a later reading. Second, and more important, this solution does not carry over to unbalanced panels.

6. An egen approach

Here is another way to do it. In some ways, it is worse technique, but in other ways it shows more of the power of Stata.

. bysort company: egen mvalue_ref = total(cond(year == 1950, mvalue, .))

The expression cond(year == 1950, mvalue, .) uses the value for mvalue when year is 1950 and missing otherwise.

When run through egen, total() under the aegis of by panel:, the value for 1950 is applied to every observation in each panel. egen, total() ignores missings in its calculation, so the total is identical to the value for 1950.

Once again, think through what might go wrong. If mvalue were missing for 1950 in any panel, then the total would also be missing for that panel. If there were no observation for 1950 in any panel, then the total of all the other values in that panel would be calculated. In both cases, missing would be returned as panel result, as would be fair.

This approach is not equivalent to

. bysort company: egen mvalue_ref = total(mvalue) if year == 1950

because that leaves missings almost everywhere and is absolutely no gain over (and in fact markedly less efficient than)

. gen mvalue_ref = mvalue if year == 1950

Other examples of using egen in panel problems are given in other FAQs:

How can I drop spells of missing values at the beginning and end of panel data?

How can I identify first and last occurrences systematically in panel data?

In various versions before Stata 9, egen, total() was called egen, sum(). If you wish to learn more about cond(), see the tutorial by Kantor and Cox (2005). For a more general discussion of comparisons within datasets, see Cox (2011).

References

Cox, N. J. 2002.
Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.
Kantor, D., and N. J. Cox. 2005.
Depending on conditions: A tutorial on the cond() function. Stata Journal 5: 413–420.
Cox, N. J. 2011.
Speaking Stata: Compared with ... Stata Journal 11: 305–314.