The following material is based on postings to
Statalist.
How can I generate a variable relating panel data to a reference panel?
|
Title
|
|
Generating variables relating panel data to a reference panel
|
|
Author
|
Nicholas J. Cox, Durham University, UK
Scott Merryman, USDA
|
|
Date
|
July 2002; updated July 2011
|
Question:
I have panel data. One panel (country, company, person, whatever) serves as
a reference panel. How do I relate values for other panels to that reference
panel, say, as a ratio (this panel’s value for this time) / (reference
panel’s value for this time)? Or, how do I work with one time (1700, 1952,
whatever) as a reference time?
Answer:
1. Identify the variable of interest and the reference
To fix ideas, we will work with a panel dataset, which may be downloaded
from the Stata website:
. use http://www.stata-press.com/data/r12/grunfeld, clear
This dataset includes economic data on 10 anonymous companies for 20 years,
1935–54. The panel structure has been specified with
xtset. The
panel variable is company, and the time
variable is year. Formally, setting the
panel structure using xtset is not essential to
what follows, but it is a good idea anyway.
The company designated 1 has highest mvalue
throughout this period, so we will use this as the reference company to
illustrate technique and will work with
mvalue for illustration.
Later, we will show how the same Stata logic can be used when you want to
work with a reference time. Substantively, that problem is different,
but the program logic is identical.
Selecting company 1s values is easy:
. generate mvalue_ref = mvalue if company == 1
The natural consequence of this command is to put missing values into the
observations of mvalue_ref for other
companies.
2. Spread the reference’s values to other observations
The next step, and the one that at first sight appears trickiest, is to
spread those values to other companies. Let us imagine the results of a
sort first on
year and then within
year on
mvalue_ref. For example, the first value
of year is 1935, and within all the values
for 1935, the 10 companies would be sorted according to
mvalue_ref. Now, at most, one value for
mvalue_ref for 1935 will be nonmissing, the
value for company 1; all others will be missing, as just mentioned. The
result of a sort on numeric values is always to place missing values
last, after any nonmissing values. In short, for each
year, any nonmissing value will always end
up as the first value. We can exploit this fact to spread the reference
panel’s values to other panels:
. bysort year (mvalue_ref): replace mvalue_ref = mvalue_ref[1]
This statement does a lot at once, which can be unpacked as follows:
- sort on year
and then within year on
mvalue_ref
- For each block of data for each year,
replace mvalue_ref with the first value it
takes in that block. (Under
by varlist:,
subscripts such as [1] are interpreted
within each block, not with reference to the dataset as a whole.)
For more on bysort,
see the manuals or the tutorial at Cox (2002).
3. Calculate the desired quantity
Now we can calculate whatever we want, say,
. generate mvalue_sc = mvalue/mvalue_ref
or the logarithm of that, or a difference.
4. What could go wrong with this approach?
This example dataset appears to be in excellent condition. There are no
missing values in the dataset, and it is complete with all companies
represented for all years. Two things often go wrong with other panel
datasets.
First, suppose that mvalue were in fact
missing for 1935 and for company 1. Then all 10 values of
mvalue_ref would be born missing for 1935 and
would remain so after sorting and then
replacement. Some missing value would get sorted
to first position within 1935 (it does not matter which, as they are all
identical), and a missing would get overwritten by a missing. This seems
fair.
Second, suppose mvalue for 1935 and company
1 were, in fact, omitted from the dataset. Then all nine values of
mvalue_ref for companies 2–10 would
be born missing for 1935, and the result would resemble that in the previous
situation, although for a different reason.
Thus the two main problems have reasonable consequences and do not affect
observations except for the year to which
they apply.
5. How do we apply this approach to a reference time?
The same Stata logic applies when we wish to work with a reference time.
Suppose for this dataset that we wish to compare values with those in 1950.
First, use the values for 1950 only:
. generate mvalue_ref = mvalue if year == 1950
Then spread that value to the other observations in each panel:
. bysort panel (mvalue_ref): replace mvalue_ref = mvalue_ref[1]
You might be tempted to do something like this:
. bysort panel (year): gen mvalue_ref = mvalue[16]
That approach hinges on noticing that the dataset in question is a balanced
panel, in which each company is observed for all years from 1935 to 1954.
Thus, when each panel is in year order, the
value for 1950 is the 16th in each panel. The solution works for this
example, and similar solutions will work whenever you have balanced panels,
but it is not good style. First, in a log file the code above may not be
transparent unless you add a comment, since the implication of using
observation 16 may well not be obvious on a later reading. Second, and more
important, this solution does not carry over to unbalanced panels.
6. An egen approach
Here is another way to do it. In some ways, it is worse technique, but in
other ways it shows more of the power of Stata.
. bysort panel: egen mvalue_ref = total(cond(year == 1950, mvalue, .))
The expression cond(year == 1950, mvalue, .) uses
the value for mvalue when
year is 1950 and missing otherwise.
When run through egen, total() under the aegis of
by panel:, the value for 1950 is
applied to every observation in each panel. egen,
total() ignores missings in its calculation, so the total is
identical to the value for 1950.
Once again, think through what might go wrong. If
mvalue were missing for 1950 in any panel,
then the total would also be missing for that panel. If there were no
observation for 1950 in any panel, then the total of all the other values in
that panel would be calculated. In both cases, missing would be returned as
panel result, as would be fair.
This approach is not equivalent to
. bysort panel: egen mvalue_ref = total(mvalue) if year == 1950
because that leaves missings almost everywhere and is absolutely no gain over
(and in fact markedly less efficient than)
. gen mvalue_ref = mvalue if year == 1950
Other examples of using egen in panel problems are
given in other FAQs:
How can I drop spells of missing values at the beginning and end of panel
data?
How can I identify first and last occurrences systematically in panel data?
In various versions before Stata 9,
egen, total() was called
egen, sum(). If you wish to learn more about
cond(), see the tutorial by Kantor and Cox
(2005).
References
- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step.
Stata Journal 2: 86–102.
- Kantor, D., and N. J. Cox. 2005.
- Depending on conditions: A tutorial on the cond() function.
Stata Journal 5: 413–420.
|