Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: How to set a year index?


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: How to set a year index?
Date   Fri, 20 May 2011 15:51:35 +0100

The next Speaking Stata column (Stata Journal 11(2), 2011) covers precisely this topic, and much more in the same territory. 

Following my signature are a few extracts. I don't use the year 2005 or 2500 as an example, but clearly that's secondary.

Nick 
n.j.cox@durham.ac.uk

Often in data management there is a need to compare values in a variable
with other values for different observations (rows, cases, or records in
non-Stata terminology).

An easy first example is relating values to a single reference value in
another observation: suppose, say, that you regard Texas or London or
your home area as a reference, or that you wish to scale values relative
to some base year such as 1980 or 2000.  

We will start with using some other observation as reference. Let us
read in some data: 

. sysuse uslifeexp 

This dataset contains various time series for life expectancy in the
United States. Suppose we want to relate changes to the base year 1960.
In this case, the dataset is of modest size and it is easy to find that
values for 1960 are in observation 61. So we could use subscripting to
relate values to that observation, as in 

. generate le_female_index = 100 * le_female / le_female[61] 

The advantage of this method is directness. The value desired as
reference is already a data value, so there is no need to calculate it.
Once we have found out which observation contains the value we need, we
can just indicate the observation number by a subscript, here [61].

The disadvantages of this method are a little more subtle. Suppose that
we are careful and keep a record of our calculations, but not so careful
as to add a comment explaining exactly what this calculation does. Then
there could be a minor puzzle working out some time later what it is
that we did.  Or suppose that we change the sort order of the data for
some reason. Then the new observation 61 is very likely to be a
different observation, and the same command line would yield a different
calculation, as we may or may not realise. Or suppose that we want to do
something like this in different datasets, in which there is no
predictable regularity about which observation contains the value you
want. Then the lack of generality of the method is clear. 

More general methods are ready to hand. If we 

. summarize le_female if year == 1960 

then the -if- condition will in this dataset identify just one
observation, and the value of -le_female- for that observation will
be accessible after -summarize- in one of r(min)}, r(mean)}, or r(max). 
Then the calculation to follow will be 

. generate le_female_index = 100 * le_female / r(mean)       

Clearly you may use r(min) or r(max), rather than r(mean), if you please, but the difference is quite immaterial, as the mean, minimum and maximum of a single value are all identical to that value. 

This method has the advantage over the first of being easier to
understand in a log file consulted some time after the event. Naturally,
adding an explanatory comment would make it even easier. 

What could go wrong? Suppose that there is no value for 1960. Then after
-summarize-, the results would be missing, and so would our new
variable be. So we would notice that problem sooner or later. 

Conversely, suppose that there are two or more observations for 1960. As
said, the fact will be evident in the results for -summarize-. If we
were automating calculations, it would be a good idea to check that the
number of non-missing values, accessible after -summarize- in r(N), was equal to 1. 

Here is yet another way to do it. Like the -summarize- method just
explained, it is more elaborate than the first method, but it can be
greatly extended to more complicated and more challenging problems.
First, create an indicator tag or flag variable that is 1 for the
observation we want to copy. 

. gen byte tag = 1 if year == 1960 

Notice the detail of creating a -byte- variable to reduce storage.
This new variable -tag- will be 1 when year is 1960 and numeric
missing (.) when year is not equal to 1960. We can now  

. sort tag 

After sorting, the observation for the tagged year, 1960, will be sorted
to observation 1. The index now can be created by 

. gen le_female_index = 100 * le_female / le_female[1] 

You can see the advantage of this technique. It is a way of making Stata
first find and then use the value for 1960, regardless of where it
occurs in the dataset, or of whether the dataset had any particular sort
order. 

Let us now consider extending the problem to panel data. We will use
another of Stata's datasets:  

. webuse grunfeld, clear 

This is a well-behaved panel dataset, in which each year in the dataset
is matched by a non-missing value for each panel and each variable. But
we show a technique that does not make that assumption. The panel covers
the period 1935 to 1954. Let's show how to scale each panel separately
by values in 1939. 

. gen byte baseyear = 1 if year == 1939 
. bysort company (baseyear) : gen invest_index = 100 * invest / invest[1] 

What is new here that we are using -by:- to calculate separately
within the groups it defines. The last command does three things in
quick succession: 

1. It declares that operations will be done separately by -company-. 

2. It sorts first on -company- and then within -company- by the
new variable -baseyear-. As -baseyear- has values 1 or missing,
values of 1 will be sorted to the start of each panel for an individual
company. 

3. A new variable is created using -generate- and the expression
given. A key feature of using -by:- is that subscripts are
interpreted within groups, rather than within the entire dataset. Thus
the subscript [1] refers to the first observation for each -company-
in the current sort order. 


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Barbara Engels
Sent: 20 May 2011 15:43
To: statalist@hsphsun2.harvard.edu
Subject: st: How to set a year index?

I have a time series from 1990 to 2010 and want to set the year 2005 as an index year (2500=100) so as to evaluate the other observations with reference to 2005. How can I do that with Stata commands?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index