Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: basic programming tips


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE: basic programming tips
Date   Thu, 12 Oct 2006 17:15:39 +0100

One of Scott's examples pivots on knowing the 
maximum value of a variable seen so far, which is 
the "record" to date, that is, at least 
when high values are hard to achieve. 

Assuming some wider interest in the precise 
Stata logic behind solutions, I will spell 
that out in considerable detail in this expository 
posting. If this is way too elementary to you, 
congratulations, and please pass it to a student 
or colleague who might find it useful. 

The solution is easily adapted to get the case of
the minimum seen so far. 

Assume some numeric response -y- measured in a series of 
years, -year-, and get the data in year order: 

. sort year 

If you look at a series of values, how would 
you determine the series of records without Stata? One 
simple algorithm, in a mixture of Stata and 
pseudo-code, runs like this: 

	initialise : record[1] = y[1] 

Clearly, the first value observed is the highest so far. 

	loop : 
	forvalues this = 2(1)_N { 
		if y[this] > record[this - 1] { 
			record[this] = y[this] 
		}
		else record[this] = record[this - 1] 
	} 

So we start the loop with "this" set to 2, and the 
operation becomes 

	if y[2] > record[1] { 
		record[2] = y[2]
	}
	else record[2] = record[1] 

and next time around the loop we set "this" to 3, 
and we get 

	if y[3] > record[2] { 
		record[3] = y[3]
	}
	else record[3] = record[2] 

and so on. With inequalities, here >, it is always a 
good idea to check whether the fraternal twin, here >=, 
is needed instead, but a moment's thought shows that 
if the new value is the same as the record to date, we don't 
need to change the record. (If your problem had some extra 
flavour, such as an interest in precisely when the 
record was broken or equalled, you would need to take extra care.) 

Now I have to confess that setting this up as a looping
problem is in at least one sense a red herring
(http://en.wikipedia.org/wiki/Red_herring), as you 
don't need to reach for any of Stata's looping machinery, 
-forval-, -foreach- or -while-. In fact, you definitely
should not do that. We can exploit the fact that 
-generate- and -replace- use Stata's sort order, made
explicit in 

SJ-4-4  dm0008   Stata tip 13: generate and replace use the current sort order
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  R. Newson
        Q4/04   SJ 4(4):484--485                                 (no commands)
        tips for using generate and replace, which use the
        current sort order

(A tip on Tips: Non-subscribers to the Stata Journal can purchase
a compact collation of the first 33 tips: see 
http://www.stata.com/bookstore/tips.html.) 

and in 

FAQ     . . . . . . . . . . . . . . . . . . . . . . . Replacing missing values
        2/03    How can I replace missing values with previous or
                following nonmissing values?
                http://www.stata.com/support/faqs/data/missing.html

The first thing we should do is initialise. We could do 
that by setting every value of -record- to -y[1]- 

. gen record = y[1] 

or by just setting the first value of -record- to -y[1]- 

. gen record = y[1] in 1 

Absent any instructions on what to do about -record[2]-
onwards, Stata can only shrug its shoulders as best it can
and set those all values to numeric missing (.). 

In other problems, it will matter how you start; 
we make sure that it does not matter here by over-writing
every value from the second onwards.  

You probably already know about a general way to 
refer to the previous observation in Stata, using 
the subscript _n - 1. _n stands in general for
the current observation number, and so _n - 1
for the one before. We will want to -replace- 
the values of -record- in observations 2 onward, 
so we can pencil that in. One way is 

. replace record = ... in 2/l 

where "2/l" signals observations from 2 until
the last ("l" for last). If we had indeed previously 
written 

. gen record = y[1] in 1 

we could pencil in 

. replace record = ... if record == . 

OR 

. replace record = ... if missing(record) 

A key point here is that the -replace- statement will implement 
the loop in the algorithm previously given, as -replace- 
follows the current sort order. We just need to fill 
in the dots ... in the command above. The dots will 
be a Stata translation of 

if y[this] > record[this - 1] { 
	record[this] = y[this] 
}
else record[this] = record[this - 1] 

One way to do this is through the -cond()- function. 
-cond()- lets you do different things, depending
on whether a specified condition is true or false, 
which is what we have here. There is a tutorial with
several examples at 

SJ-5-3  pr0016  . . Depending on conditions: a tutorial on the cond() function
        . . . . . . . . . . . . . . . . . . . . . . .  D. Kantor and N. J. Cox
        Q3/05   SJ 5(3):413--420                                 (no commands)
        tutorial on the cond() function

Our true or false condition is 

y[this] > record[this - 1]

and as previously mentioned "this" in Stata terms when referring to 
observations is _n, so this in Stata becomes 

y[_n] > record[_n - 1]

But -y[_n]- can just be written -y-, as Stata by default always 
works on the current observation number (or, more precisely, 
all of them in sequence). 

-cond()- works like this: 

-cond(<true-or-false-condition>, <result if true>, <result if false>)- 

so with one bound we get a solution: 

. gen record = y[1] in 1 
. replace record = cond(y > record[_n - 1], y, record[_n - 1]) in 2/l 

This looks like progress, I trust, but we are not quite there. 
We should worry about what happens with missings. If you know
that your data will never be missing, you need not worry.... 

If -y[1]- were missing, -record[1]- would be too, and the -replace- 
statement would never change our mind about -record-, as nothing
can be greater than missing, and missing values would get propagated 
down the dataset in a cascade. 

Similarly, if any -y- in observations 2 onwards were missing, 
the record would be set to missing and that value propagated downwards. 
The second problem is easier to fix: 

. replace record = cond(y > record[_n - 1] & y < ., y, record[_n - 1]) 

Hence we should only change -y- when it is non-missing. The first
problem requires extra surgery. (If you know about extended missing 
values .a through .z, know that they complicate this story a little
without changing essentials.) 

However, at this point, I am going to switch to another solution, 
which indeed may have occurred to you much earlier. It is just
tempting to look at a -cond()- solution, because -cond()- is 
a very good general do-it-yourself function in which you spell out what
you want given two possible outcomes. 

The maximum of two values (the larger) in Stata is given directly 
by -max()-. This has a property which at first sight may seem 
surprising, but needs to be grasped. Convince yourself of it by

. display max(1,.) 
1

. display max(.,.) 
. 

In brief, -max()- ignores missing values. Only when all arguments
supplied are missing does it show its own limited version of despair
and return a missing value. If this seems surprising, it is because
you have learned -- sometimes painfully -- that in Stata numeric
missing is reckoned to be arbitrarily large, or at least bigger than
any other number that Stata can hold. -max()- is an exception to
that, or perhaps it is better to say, as I just did, that -max()-
ignores missing values unless it has no alternative. Either way, 
-max()- solves, or dissolves, all our worries about missing values. 

. gen record = y[1] in 1 
. replace record = max(y, record[_n - 1]) in 2/l 

or

. gen record = y[1] in 1  
. replace record = max(y, record[_n - 1]) if missing(record) 

That's it. Running through the worries again:

If -record[1]- is missing because -y[1]- is missing, the first 
non-missing value encountered is still used as the first non-missing record 
because -max(<missing>, <non-missing>)- is always <non-missing>. 

If any -record- from observation 2 on is missing, this does not 
matter for the same reason. So a missing value in the sequence will
not contaminate the values of -record- perversely. 

Two complications are easy to discuss and dismiss. This is all very 
well, you say, but I have panel (longitudinal) data. How do I deal 
with that too? The answer is just to use -by:-, as in something like 

bysort id (year) : gen record = y[1] if _n == 1 
by id : replace record = max(y, record[_n - 1]) if missing(record)

One key feature of -by:- is that subscripts like [1] and [_n - 1] 
are interpreted _within the groups defined by -by:-_. If you 
want more on that, that is another story. See the sections of the manual
on -by:- or

SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by: step
        Q1/02   SJ 2(1):86-102                                   (no commands)
        explains the use of the by varlist : construct to tackle
        a variety of problems with group structure, ranging from
        simple calculations for each of several groups to more
        advanced manipulations that use the built-in _n and _N

Finally, you want records that are minima, not maxima? Use -min()-, not -max()-. 

Nick 
n.j.cox@durham.ac.uk 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index