This FAQ is based on material posted on
Statalist.

Title | Calculating the maximum and minimum of a sequence | |

Author | Nicholas J. Cox, Durham University |

The maximum value of a variable seen so far in a sequence is the “record” to date, that is, at least when high values are hard to achieve (think of field events such as jumping or throwing). It will become apparent that the best solution to getting this is easily adapted to get the case of the minimum seen so far, the record when low values are hard to achieve (think of track events such as running).

For example, consider some numeric response **y** measured in a series of
years, **year**, and get the data in year order:

. sort year

If you look at a series of values, how would you determine the series of records without Stata? One simple algorithm, in a mixture of Stata and pseudocode, runs like this:

initialize: record[1] = y[1]

The first value observed is the highest so far.

loop: forvalues this = 2(1)_N { if y[this] > record[this − 1] { record[this] = y[this] } else record[this] = record[this − 1] }

The loop runs over the integers from 2 to the number of observations
**_N**. So we start the loop with **this** set to 2, and the
operation becomes

if y[2] > record[1] { record[2] = y[2] } else record[2] = record[1]

and next time around the loop we set **this** to 3, and we get

if y[3] > record[2] { record[3] = y[3] } else record[3] = record[2]

and so on. With inequalities, here >, it is always a good idea to check whether the fraternal twin, here >=, is needed instead, but a moment’s thought shows that if the new value is the same as the record to date, we do not need to change the record. (If your problem had some extra details, such as an interest in precisely when the record was broken or equalled, you would need to take extra care.)

Now setting this up as a looping problem is, in at least one sense, a red
herring, as you do not need to reach for any of Stata’s looping machinery,
**forvalues**,
**foreach**, or
**while**. In
fact, you definitely should not do that. We can exploit the fact that
**generate** and
**replace** use
Stata’s sort order, made explicit in Newson (2004)
and in the
“FAQ: How can
I replace missing values with previous or following nonmissing values?”.

The first thing we should do is initialize. We could do that by setting
every value of **record** to **y[1]**

. generate record = y[1]

or just by setting the first value of **record** to **y[1]**

. generate record = y[1] in 1

Without any instructions on what to do about **record[2]** onwards, Stata
can only set those values to numeric missing (.).

In other problems, it will matter how you start; we make sure that it does not matter here by overwriting every value from the second onwards.

You probably already know about a general way to refer to the previous
observation in Stata, using the subscript **_n − 1**. **_n**
stands in general for the current observation number, and so **_n −
1** for the one before. We will want to
**replace** the
values of **record** in observations 2 onward, so we can pencil that in.
One way is

. replace record =...in 2/l

where **2/l** signals observations from **2** until the last
(**l** for last). If we had indeed previously written

. generate record = y[1] in 1

We could pencil in

. replace record =...if record == .

or

. replace record =...if missing(record)

A key point here is that the **replace** statement will implement the
loop in the algorithm previously given, as **replace** follows the
current sort order. We just need to fill in the dots (**...**) in the
command above. The dots will be a Stata translation of

if y[this] > record[this - 1] { record[this] = y[this] } else record[this] = record[this - 1]

One way to do this is through the
**cond()** function.
**cond()** lets you do different things, depending on whether a
specified condition is true or false, which is what we have here.
Kantor and Cox (2005) give a tutorial with several examples.

Our true or false condition is

y[this] > record[this - 1]

and, as previously mentioned, **this** is **_n** in Stata terms when
referring to observations. In Stata, this condition becomes

y[_n] > record[_n - 1]

But **y[_n]** can just be written **y**, as Stata by default always
works on the current observation number (or, more precisely, all of them in
sequence).

**cond()** works like this:

cond(true-or-false-condition,result if true,result if false)

With one bound we get a solution:

. generate record = y[1] in 1 . replace record = cond(y > record[_n - 1], y, record[_n - 1]) in 2/l

This looks like progress, but we are not quite there. We should worry about what happens with missing values. If you know that your data will never be missing, you need not worry.

If **y[1]** were missing, **record[1]** would be too, and the
**replace** statement would never change our mind about the **record**,
as nothing can be greater than missing, and missing values would get
propagated down the dataset in a cascade.

Similarly, if any **y** in observations 2 onwards were missing, the
record would be set to missing and that value propagated downwards. The
second problem is easier to fix:

. replace record = cond(y > record[_n - 1] & y < ., y, record[_n - 1])

Hence, we should change **y** only when it is nonmissing. The first
problem requires extra surgery, left as a small exercise for you to
consider. (If you know about extended missing values .a through .z, you know
that they complicate the story a little without changing essentials.)

However, at this point, I am going to switch to another solution, which
indeed may have occurred to you much earlier. It is just tempting to look
at a **cond()** solution, because **cond()** is a good general
do-it-yourself function in which you spell out what you want given two
possible outcomes.

The maximum of two values (the larger) in Stata is given directly by
**max()**. This has a
property which at first sight may seem surprising, but needs to be grasped.
Convince yourself of it by

. display max(1,.)1. display max(.,.).

In brief, **max()** ignores missing values. Only when all arguments
supplied are missing does it show its own limited version of despair and
return a missing value. If this seems surprising, it is because you have
learned—sometimes painfully—that in Stata numeric missing is
reckoned to be arbitrarily large, or at least bigger than any other number
that Stata can hold. **max()** is an exception to that, or perhaps it is
better to say that **max()** ignores missing values unless it has no
alternative. Either way, **max()** solves, or dissolves, all our worries
about missing values.

. generate record = y[1] in 1 . replace record = max(y, record[_n - 1]) in 2/l

or

. generate record = y[1] in 1 . replace record = max(y, record[_n - 1]) if missing(record)

That is it. Running through the worries again:

If **record[1]** is missing because **y[1]** is missing, the first
nonmissing value encountered is still used as the first nonmissing record
because **max(***missing***,** *nonmissing***)** is
always *nonmissing*.

If any **record** from observation 2 on is missing, this does not matter for
the same reason. So a missing value in the sequence will not contaminate the
values of **record** perversely.

Two complications are easy to deal with. Suppose that you have panel
(longitudinal) data. How do you deal with that? The answer is just to
use **by:**, as in something like

. by id(year), sort: generate record = y[1] if _n == 1 . by id: replace record = max(y, record[_n - 1]) if missing(record)

One key feature of
**by:** is that subscripts
like **[1]** and **[_n − 1]** are interpreted within the groups
defined by **by:**. If you want more on that, that is another story.
See the sections of the manual on **by:** or Cox (2002).

Finally, if you want minimums and not maximums, use
**min()** and not
**max()**.

All that said, since 2016 and for Stata 11+, the maximum so far or the
minimum so far can be calculated directly with the user-written program
**rangestat**, which should be installed from SSC using **ssc inst rangestat**.
An example you can try yourself

.webuse grunfeld, clear.rangestat (max) mvalue, int(year . 0) by(company)

gives the flavor. Here the time interval is given by the option **int(year .
0)**. **year** is the variable concerned and **.** (system missing) and **0** are
negative and positive offsets. System missing as negative offset is here
interpreted as including the earliest date in the data. 0 is just zero
as usual, so the interval is a specification of "so far". The option
**by(company)** ensures that calculations are carried out separately for each
panel. The help for **rangestat** gives detailed discussion with several
worked examples.

- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step.
*Stata Journal*2:86–102.

- Kantor, D., and N. J. Cox. 2005.
- Depending on conditions: a tutorial on the cond() function.
*Stata Journal*5:413–420.

- Newson, R. 2004.
- Stata tip 13: generate and replace use the current sort order.
*Stata Journal*4:484–485.

(A tip on Stata Tips: nonsubscribers to the *Stata Journal* can purchase a compact
collation of the first 119 tips: see
www.stata.com/bookstore/stata-tips.)