This FAQ is based on material posted on Statalist.
How do I calculate the maximum or minimum seen so far in a sequence?
|
Title
|
|
Calculating the maximum and minimum of a sequence
|
|
Author
|
Nicholas J. Cox, Durham University
|
|
Date
|
October 2006
|
The maximum value of a variable seen so far in a sequence is the
“record” to date, that is, at least when high values are hard to
achieve (think of field events such as jumping or throwing). It will become
apparent that the best solution to getting this is easily adapted to get the
case of the minimum seen so far, the record when low values are hard to
achieve (think of track events such as running).
For example, consider some numeric response y measured in a series of
years, year, and get the data in year order:
. sort year
If you look at a series of values, how would you determine the series of
records without Stata? One simple algorithm, in a mixture of Stata and
pseudocode, runs like this:
initialize: record[1] = y[1]
The first value observed is the highest so far.
loop:
forvalues this = 2(1)_N {
if y[this] > record[this − 1] {
record[this] = y[this]
}
else record[this] = record[this − 1]
}
The loop runs over the integers from 2 to the number of observations
_N. So we start the loop with this set to 2, and the
operation becomes
if y[2] > record[1] {
record[2] = y[2]
}
else record[2] = record[1]
and next time around the loop we set this to 3, and we get
if y[3] > record[2] {
record[3] = y[3]
}
else record[3] = record[2]
and so on. With inequalities, here >, it is always a good idea to check
whether the fraternal twin, here >=, is needed instead, but a
moment’s thought shows that if the new value is the same as the record
to date, we do not need to change the record. (If your problem had some
extra details, such as an interest in precisely when the record was broken or
equalled, you would need to take extra care.)
Now setting this up as a looping problem is, in at least one sense, a red
herring, as you do not need to reach for any of Stata’s looping machinery,
forvalue,
foreach, or
while. In
fact, you definitely should not do that. We can exploit the fact that
generate and
replace use
Stata’s sort order, made explicit in Newson (2004)
and in the FAQ
“How can
I replace missing values with previous or following nonmissing values?”.
The first thing we should do is initialize. We could do that by setting
every value of record to y[1]
. generate record = y[1]
or just by setting the first value of record to y[1]
. generate record = y[1] in 1
Without any instructions on what to do about record[2] onwards, Stata
can only set those values to numeric missing (.).
In other problems, it will matter how you start; we make sure that it does
not matter here by overwriting every value from the second onwards.
You probably already know about a general way to refer to the previous
observation in Stata, using the subscript _n − 1. _n
stands in general for the current observation number, and so _n −
1 for the one before. We will want to
replace the
values of record in observations 2 onward, so we can pencil that in.
One way is
. replace record = ... in 2/l
where 2/l signals observations from 2 until the last
(l for last). If we had indeed previously written
. generate record = y[1] in 1
We could pencil in
. replace record = ... if record == .
or
. replace record = ... if missing(record)
A key point here is that the replace statement will implement the
loop in the algorithm previously given, as replace follows the
current sort order. We just need to fill in the dots (...) in the
command above. The dots will be a Stata translation of
if y[this] > record[this - 1] {
record[this] = y[this]
}
else record[this] = record[this - 1]
One way to do this is through the
cond() function.
cond() lets you do different things, depending on whether a
specified condition is true or false, which is what we have here.
Kantor and Cox (2005) give a tutorial with several examples.
Our true or false condition is
y[this] > record[this - 1]
and, as previously mentioned, this is _n in Stata terms when
referring to observations. In Stata, this condition becomes
y[_n] > record[_n - 1]
But y[_n] can just be written y, as Stata by default always
works on the current observation number (or, more precisely, all of them in
sequence).
cond() works like this:
cond(true-or-false-condition, result if true, result if false)
With one bound we get a solution:
. generate record = y[1] in 1
. replace record = cond(y > record[_n - 1], y, record[_n - 1]) in 2/l
This looks like progress, but we are not quite there. We should worry
about what happens with missing values. If you know that your data will
never be missing, you need not worry.
If y[1] were missing, record[1] would be too, and the
replace statement would never change our mind about the record,
as nothing can be greater than missing, and missing values would get
propagated down the dataset in a cascade.
Similarly, if any y in observations 2 onwards were missing, the
record would be set to missing and that value propagated downwards. The
second problem is easier to fix:
. replace record = cond(y > record[_n - 1] & y < ., y, record[_n - 1])
Hence, we should change y only when it is nonmissing. The first
problem requires extra surgery, left as a small exercise for you to
consider. (If you know about extended missing values .a through .z, you know
that they complicate the story a little without changing essentials.)
However, at this point, I am going to switch to another solution, which
indeed may have occurred to you much earlier. It is just tempting to look
at a cond() solution, because cond() is a good general
do-it-yourself function in which you spell out what you want given two
possible outcomes.
The maximum of two values (the larger) in Stata is given directly by
max(). This has a
property which at first sight may seem surprising, but needs to be grasped.
Convince yourself of it by
. display max(1,.)
1
. display max(.,.)
.
In brief, max() ignores missing values. Only when all arguments
supplied are missing does it show its own limited version of despair and
return a missing value. If this seems surprising, it is because you have
learned—sometimes painfully—that in Stata numeric missing is
reckoned to be arbitrarily large, or at least bigger than any other number
that Stata can hold. max() is an exception to that, or perhaps it is
better to say that max() ignores missing values unless it has no
alternative. Either way, max() solves, or dissolves, all our worries
about missing values.
. generate record = y[1] in 1
. replace record = max(y, record[_n - 1]) in 2/l
or
. generate record = y[1] in 1
. replace record = max(y, record[_n - 1]) if missing(record)
That is it. Running through the worries again:
If record[1] is missing because y[1] is missing, the first
nonmissing value encountered is still used as the first nonmissing record
because max(missing, nonmissing) is
always nonmissing.
If any record from observation 2 on is missing, this does not matter for
the same reason. So a missing value in the sequence will not contaminate the
values of record perversely.
Two complications are easy to deal with. Suppose that you have panel
(longitudinal) data. How do you deal with that? The answer is just to
use by:, as in something like
. by id(year), sort: generate record = y[1] if _n == 1
. by id: replace record = max(y, record[_n - 1]) if missing(record)
One key feature of
by: is that subscripts
like [1] and [_n − 1] are interpreted within the groups
defined by by:. If you want more on that, that is another story.
See the sections of the manual on by: or Cox (2002).
Finally, if you want minimums and not maximums, use
min() and not
max().
References
- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step.
Stata
Journal 2:86–102.
- Kantor, D., and N. J. Cox. 2005.
- Depending on conditions: a tutorial on the cond() function.
Stata
Journal 5:413–420.
- Newson, R. 2004.
- Stata tip 13: generate and replace use the current sort order.
Stata Journal 4:484–485.
(A tip on Stata Tips: nonsubscribers to the Stata Journal can purchase a compact
collation of the first 33 tips: see
http://www.stata.com/bookstore/tips.html.)
|