Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <n.j.cox@durham.ac.uk> |

To |
"'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: How to set a year index? |

Date |
Fri, 20 May 2011 15:51:35 +0100 |

The next Speaking Stata column (Stata Journal 11(2), 2011) covers precisely this topic, and much more in the same territory. Following my signature are a few extracts. I don't use the year 2005 or 2500 as an example, but clearly that's secondary. Nick n.j.cox@durham.ac.uk Often in data management there is a need to compare values in a variable with other values for different observations (rows, cases, or records in non-Stata terminology). An easy first example is relating values to a single reference value in another observation: suppose, say, that you regard Texas or London or your home area as a reference, or that you wish to scale values relative to some base year such as 1980 or 2000. We will start with using some other observation as reference. Let us read in some data: . sysuse uslifeexp This dataset contains various time series for life expectancy in the United States. Suppose we want to relate changes to the base year 1960. In this case, the dataset is of modest size and it is easy to find that values for 1960 are in observation 61. So we could use subscripting to relate values to that observation, as in . generate le_female_index = 100 * le_female / le_female[61] The advantage of this method is directness. The value desired as reference is already a data value, so there is no need to calculate it. Once we have found out which observation contains the value we need, we can just indicate the observation number by a subscript, here [61]. The disadvantages of this method are a little more subtle. Suppose that we are careful and keep a record of our calculations, but not so careful as to add a comment explaining exactly what this calculation does. Then there could be a minor puzzle working out some time later what it is that we did. Or suppose that we change the sort order of the data for some reason. Then the new observation 61 is very likely to be a different observation, and the same command line would yield a different calculation, as we may or may not realise. Or suppose that we want to do something like this in different datasets, in which there is no predictable regularity about which observation contains the value you want. Then the lack of generality of the method is clear. More general methods are ready to hand. If we . summarize le_female if year == 1960 then the -if- condition will in this dataset identify just one observation, and the value of -le_female- for that observation will be accessible after -summarize- in one of r(min)}, r(mean)}, or r(max). Then the calculation to follow will be . generate le_female_index = 100 * le_female / r(mean) Clearly you may use r(min) or r(max), rather than r(mean), if you please, but the difference is quite immaterial, as the mean, minimum and maximum of a single value are all identical to that value. This method has the advantage over the first of being easier to understand in a log file consulted some time after the event. Naturally, adding an explanatory comment would make it even easier. What could go wrong? Suppose that there is no value for 1960. Then after -summarize-, the results would be missing, and so would our new variable be. So we would notice that problem sooner or later. Conversely, suppose that there are two or more observations for 1960. As said, the fact will be evident in the results for -summarize-. If we were automating calculations, it would be a good idea to check that the number of non-missing values, accessible after -summarize- in r(N), was equal to 1. Here is yet another way to do it. Like the -summarize- method just explained, it is more elaborate than the first method, but it can be greatly extended to more complicated and more challenging problems. First, create an indicator tag or flag variable that is 1 for the observation we want to copy. . gen byte tag = 1 if year == 1960 Notice the detail of creating a -byte- variable to reduce storage. This new variable -tag- will be 1 when year is 1960 and numeric missing (.) when year is not equal to 1960. We can now . sort tag After sorting, the observation for the tagged year, 1960, will be sorted to observation 1. The index now can be created by . gen le_female_index = 100 * le_female / le_female[1] You can see the advantage of this technique. It is a way of making Stata first find and then use the value for 1960, regardless of where it occurs in the dataset, or of whether the dataset had any particular sort order. Let us now consider extending the problem to panel data. We will use another of Stata's datasets: . webuse grunfeld, clear This is a well-behaved panel dataset, in which each year in the dataset is matched by a non-missing value for each panel and each variable. But we show a technique that does not make that assumption. The panel covers the period 1935 to 1954. Let's show how to scale each panel separately by values in 1939. . gen byte baseyear = 1 if year == 1939 . bysort company (baseyear) : gen invest_index = 100 * invest / invest[1] What is new here that we are using -by:- to calculate separately within the groups it defines. The last command does three things in quick succession: 1. It declares that operations will be done separately by -company-. 2. It sorts first on -company- and then within -company- by the new variable -baseyear-. As -baseyear- has values 1 or missing, values of 1 will be sorted to the start of each panel for an individual company. 3. A new variable is created using -generate- and the expression given. A key feature of using -by:- is that subscripts are interpreted within groups, rather than within the entire dataset. Thus the subscript [1] refers to the first observation for each -company- in the current sort order. -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Barbara Engels Sent: 20 May 2011 15:43 To: statalist@hsphsun2.harvard.edu Subject: st: How to set a year index? I have a time series from 1990 to 2010 and want to set the year 2005 as an index year (2500=100) so as to evaluate the other observations with reference to 2005. How can I do that with Stata commands? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How to set a year index?***From:*Barbara Engels <engels.ba@gmail.com>

- Prev by Date:
**RE: st: How to set a year index?** - Next by Date:
**Re: st: How to set a year index?** - Previous by thread:
**RE: st: How to set a year index?** - Next by thread:
**st: ivreg2 & diagnostic statistics** - Index(es):