# RE: st: RE: basic programming tips

 From "Nick Cox" To Subject RE: st: RE: basic programming tips Date Thu, 12 Oct 2006 17:15:39 +0100

One of Scott's examples pivots on knowing the
maximum value of a variable seen so far, which is
the "record" to date, that is, at least
when high values are hard to achieve.

Assuming some wider interest in the precise
Stata logic behind solutions, I will spell
that out in considerable detail in this expository
posting. If this is way too elementary to you,
congratulations, and please pass it to a student
or colleague who might find it useful.

The solution is easily adapted to get the case of
the minimum seen so far.

Assume some numeric response -y- measured in a series of
years, -year-, and get the data in year order:

. sort year

If you look at a series of values, how would
you determine the series of records without Stata? One
simple algorithm, in a mixture of Stata and
pseudo-code, runs like this:

initialise : record[1] = y[1]

Clearly, the first value observed is the highest so far.

loop :
forvalues this = 2(1)_N {
if y[this] > record[this - 1] {
record[this] = y[this]
}
else record[this] = record[this - 1]
}

So we start the loop with "this" set to 2, and the
operation becomes

if y[2] > record[1] {
record[2] = y[2]
}
else record[2] = record[1]

and next time around the loop we set "this" to 3,
and we get

if y[3] > record[2] {
record[3] = y[3]
}
else record[3] = record[2]

and so on. With inequalities, here >, it is always a
good idea to check whether the fraternal twin, here >=,
is needed instead, but a moment's thought shows that
if the new value is the same as the record to date, we don't
flavour, such as an interest in precisely when the
record was broken or equalled, you would need to take extra care.)

Now I have to confess that setting this up as a looping
problem is in at least one sense a red herring
(http://en.wikipedia.org/wiki/Red_herring), as you
don't need to reach for any of Stata's looping machinery,
-forval-, -foreach- or -while-. In fact, you definitely
should not do that. We can exploit the fact that
-generate- and -replace- use Stata's sort order, made
explicit in

SJ-4-4  dm0008   Stata tip 13: generate and replace use the current sort order
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  R. Newson
Q4/04   SJ 4(4):484--485                                 (no commands)
tips for using generate and replace, which use the
current sort order

(A tip on Tips: Non-subscribers to the Stata Journal can purchase
a compact collation of the first 33 tips: see
http://www.stata.com/bookstore/tips.html.)

and in

FAQ     . . . . . . . . . . . . . . . . . . . . . . . Replacing missing values
2/03    How can I replace missing values with previous or
following nonmissing values?
http://www.stata.com/support/faqs/data/missing.html

The first thing we should do is initialise. We could do
that by setting every value of -record- to -y[1]-

. gen record = y[1]

or by just setting the first value of -record- to -y[1]-

. gen record = y[1] in 1

Absent any instructions on what to do about -record[2]-
onwards, Stata can only shrug its shoulders as best it can
and set those all values to numeric missing (.).

In other problems, it will matter how you start;
we make sure that it does not matter here by over-writing
every value from the second onwards.

refer to the previous observation in Stata, using
the subscript _n - 1. _n stands in general for
the current observation number, and so _n - 1
for the one before. We will want to -replace-
the values of -record- in observations 2 onward,
so we can pencil that in. One way is

. replace record = ... in 2/l

where "2/l" signals observations from 2 until
the last ("l" for last). If we had indeed previously
written

. gen record = y[1] in 1

we could pencil in

. replace record = ... if record == .

OR

. replace record = ... if missing(record)

A key point here is that the -replace- statement will implement
the loop in the algorithm previously given, as -replace-
follows the current sort order. We just need to fill
in the dots ... in the command above. The dots will
be a Stata translation of

if y[this] > record[this - 1] {
record[this] = y[this]
}
else record[this] = record[this - 1]

One way to do this is through the -cond()- function.
-cond()- lets you do different things, depending
on whether a specified condition is true or false,
which is what we have here. There is a tutorial with
several examples at

SJ-5-3  pr0016  . . Depending on conditions: a tutorial on the cond() function
. . . . . . . . . . . . . . . . . . . . . . .  D. Kantor and N. J. Cox
Q3/05   SJ 5(3):413--420                                 (no commands)
tutorial on the cond() function

Our true or false condition is

y[this] > record[this - 1]

and as previously mentioned "this" in Stata terms when referring to
observations is _n, so this in Stata becomes

y[_n] > record[_n - 1]

But -y[_n]- can just be written -y-, as Stata by default always
works on the current observation number (or, more precisely,
all of them in sequence).

-cond()- works like this:

-cond(<true-or-false-condition>, <result if true>, <result if false>)-

so with one bound we get a solution:

. gen record = y[1] in 1
. replace record = cond(y > record[_n - 1], y, record[_n - 1]) in 2/l

This looks like progress, I trust, but we are not quite there.
We should worry about what happens with missings. If you know
that your data will never be missing, you need not worry....

If -y[1]- were missing, -record[1]- would be too, and the -replace-
statement would never change our mind about -record-, as nothing
can be greater than missing, and missing values would get propagated
down the dataset in a cascade.

Similarly, if any -y- in observations 2 onwards were missing,
the record would be set to missing and that value propagated downwards.
The second problem is easier to fix:

. replace record = cond(y > record[_n - 1] & y < ., y, record[_n - 1])

Hence we should only change -y- when it is non-missing. The first
problem requires extra surgery. (If you know about extended missing
values .a through .z, know that they complicate this story a little
without changing essentials.)

However, at this point, I am going to switch to another solution,
which indeed may have occurred to you much earlier. It is just
tempting to look at a -cond()- solution, because -cond()- is
a very good general do-it-yourself function in which you spell out what
you want given two possible outcomes.

The maximum of two values (the larger) in Stata is given directly
by -max()-. This has a property which at first sight may seem
surprising, but needs to be grasped. Convince yourself of it by

. display max(1,.)
1

. display max(.,.)
.

In brief, -max()- ignores missing values. Only when all arguments
supplied are missing does it show its own limited version of despair
and return a missing value. If this seems surprising, it is because
you have learned -- sometimes painfully -- that in Stata numeric
missing is reckoned to be arbitrarily large, or at least bigger than
any other number that Stata can hold. -max()- is an exception to
that, or perhaps it is better to say, as I just did, that -max()-
ignores missing values unless it has no alternative. Either way,
-max()- solves, or dissolves, all our worries about missing values.

. gen record = y[1] in 1
. replace record = max(y, record[_n - 1]) in 2/l

or

. gen record = y[1] in 1
. replace record = max(y, record[_n - 1]) if missing(record)

That's it. Running through the worries again:

If -record[1]- is missing because -y[1]- is missing, the first
non-missing value encountered is still used as the first non-missing record
because -max(<missing>, <non-missing>)- is always <non-missing>.

If any -record- from observation 2 on is missing, this does not
matter for the same reason. So a missing value in the sequence will
not contaminate the values of -record- perversely.

Two complications are easy to discuss and dismiss. This is all very
well, you say, but I have panel (longitudinal) data. How do I deal
with that too? The answer is just to use -by:-, as in something like

bysort id (year) : gen record = y[1] if _n == 1
by id : replace record = max(y, record[_n - 1]) if missing(record)

One key feature of -by:- is that subscripts like [1] and [_n - 1]
are interpreted _within the groups defined by -by:-_. If you
want more on that, that is another story. See the sections of the manual
on -by:- or

SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by: step
Q1/02   SJ 2(1):86-102                                   (no commands)
explains the use of the by varlist : construct to tackle
a variety of problems with group structure, ranging from
simple calculations for each of several groups to more
advanced manipulations that use the built-in _n and _N

Finally, you want records that are minima, not maxima? Use -min()-, not -max()-.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/