[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <[email protected]> |

To |
<[email protected]> |

Subject |
st: RE: RE: Panel data manipulation question |

Date |
Wed, 17 Jul 2002 11:59:54 +0100 |

Paul Seed > > Taggert J Brooks' problem <[email protected]> > can be approached in a different way that gets round > the missing data / incomplete panel issue. > > > I have a panel data set on individual > > countries and I > > want to create a new variable. The variable is going to be > > the ratio of a > > variable I already have in my dataset, we'll call it P. I > > want to create a > > variable we'll call X where x=p/p(i==72). That is to say I > > want to deflate > > all the values of p, by the value of p for country 72, for > > the corresponding > > year. > > Assume the data is saved in myfile.dat. > > use country year p using myfile if country == 72, clear > drop country > rename p p72 > mmerge year using myfile , type(1:n) > gen x = p/p72 > > x will be missing only if p72 == 0, p ==. or p72 == . > There are clearly different ways to solve this problem, but it _can_ be done without file manipulations, keeping all data in place. There is, as far as I can see, no missing data or incomplete panel issue, but I should have explained that more clearly in earlier postings. Scott Merryman posted a solution using an example with a well-behaved data set. His example exploited the structure of that data set and was clearly ad hoc and designed to show an approach, and not a general solution. Generalisation is easy using the white magic of by:. Scott and I are preparing a FAQ on the question, and here is part of a draft: I have panel data. One panel (country, company, person, whatever) serves as a reference panel. How do I relate variables for other panels to that reference panel, say as a ratio (this panel's value for this time) / (reference panel's value for this time)? Answer: 1. Identify the variable of interest and reference panel To fix ideas, we will work with a panel data set which may be downloaded from the Stata website: . use http://www.stata-press.com/data/r7/grunfeld.dta, clear This data set includes economic data on 10 anonymous companies for 20 years, 1935-54. The data have been -tsset- and the panel variable is -company- and the time variable is -year-. Formally setting the data as time series using -tsset- is not essential to what follows, but it is a good idea anyway. The company designated 1 has highest -mvalue- throughout this period, so we will use this as the reference company to illustrate technique, and work with -mvalue- for illustration. Selecting that company's values is easy: . gen mvalue_ref = mvalue if company == 1 Let us emphasise here that the natural consequence of that command is to put missing values into the observations of -mvalue_ref- for other companies. 2. Spread the reference panel's values to other panels The next step, and the one that at first sight appears trickiest, is to spread those values to other companies. Let us imagine the results of a -sort- first on -year- and then within -year- on -mvalue_ref-. For example, the first value of -year- is 1935, and within all the values for 1935 the ten companies would be sorted according to -mvalue_ref-. Now at most one value for -mvalue_ref- for 1935 will be non-missing, the value for company 1; all others will be missing, as just mentioned. The result of a -sort- on numeric values is always to place missing values last, after any non-missing values. In short, for each year, any non-missing value will always end up as the first value. We can exploit this fact to spread the reference panel's values to other panels: . bysort year (mvalue_ref) : replace mvalue_ref = mvalue_ref[1] This statement does a lot at once, which can be unpacked as follows: - sort on -year- and then within -year- on -mvalue_ref- - for each block of data for each year, replace -mvalue_ref- by the first value it takes in that block. (Under -by varlist:-, subscripts such as [1] are interpreted within each block, not with reference to the data set as a whole.) For more on -bysort-, see the manuals, or a tutorial at Stata Journal 2(1), 86-102 (2002). 3. Calculate the desired quantity Now we are home and dry, and can calculate whatever we want, say . gen mvalue_sc = mvalue / mvalue_ref or the logarithm of that, or a difference. 4. What could go wrong with this approach? This example data set appears to be in excellent condition. There are no missing values in the data set, and it is complete with all companies represented for all years. But two things often go wrong with other panel data sets. First, suppose that -mvalue- were in fact missing for 1935 and company 1. Then all 10 values of -mvalue_ref- would be born missing for 1935, and remain so after -sort-ing and then -replace-ment. Some missing value would get sorted to first position within 1935 (it does not matter which as they are all identical) and a missing would get overwritten by a missing. This seems fair. Second, suppose -mvalue- for 1935 and company 1 were in fact omitted from the data set. Then all 9 values of -mvalue_ref-, for companies 2 through 10, would be born missing for 1935, and the result would resemble that in the previous situation, although for a different reason. Thus the two main problems have reasonable consequences and do not affect observations except for the year to which they apply. Nick [email protected] * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: RE: Panel data manipulation question***From:*"Paul Seed" <[email protected]>

- Prev by Date:
**st: RE: Panel data manipulation question** - Next by Date:
**st: RE: Difference in correlation coefficients** - Previous by thread:
**st: RE: Panel data manipulation question** - Next by thread:
**st: stcox** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |