Title | Calculating row medians | |

Author | Nicholas J. Cox, Durham University |

Calculating medians in statistical software such as Stata should be easy, and it usually is. When you want a row median—a median within observations across several variables of the same kind—it appears trickier.

This problem is worth some discussion, not just because it arises in practice. It also provides a little example of various parts of Stata in play.

To recapitulate basics: if you want a median of a variable, you could
**sort** on that
variable and pick out the median yourself. Some care is needed whenever you
are interested in only some of the observations or there are missing
values. Mostly, users fire up
**summarize,
detail**, which takes care of all such matters, and identify the
median within the results. It is then left in memory immediately as
**r(p50)**.

If you instead want to put the median into a variable, you can use
**egen, median()**. This
makes most sense when your data are subdivided into groups and you want
each group median to be recorded for later use. Hence you would use **egen,
median()** together with **
by:**.

When you have a bunch of variables of the same kind, sometimes there is an
easy answer. If those variables are really panel or longitudinal data, you
should **reshape long**
and work with a different data structure. Life will be much easier that way,
not only for row medians, which are now just panel medians, but also for almost
all kinds of analysis you that might want to do with such data.

If you know that the values of your row variables are in
numerical order, so that for example **y1** is always less than or equal
to **y2**, which in turn is less than or equal to **y3**, and so
forth, then the median can be calculated directly as either one of the
variables or the mean of two of the variables, depending on whether the
number of variables is odd or even. But that would be messed up by any
missing values.

You can get to that situation of row variables in numerical order by using
**rowsort** from SSC (requires Stata 8; works only with integer values)
or **sortrows** from SSC (requires Stata 9).

Next come situations when you have few variables (two, three, or four) over which you want to take the row median—and again no values are missing. The median of two variables is the same as their mean, so that first case is easy:

. gen median = (y1 + y2) / 2

A lesser-known trick for three variables also makes solving the problem simple:

. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)

In words: work out the row sum; then subtract the minimum and the maximum. What remains must be the median.

Now a trick for four variables is immediate:

. gen median = (y1 + y2 + y3 + y4 - /// min(y1, y2, y3, y4) - max(y1, y2, y3, y4)) / 2

In words: work out the row sum; then subtract the minimum and the maximum. What remains is the sum of the two inner values, and halving gives the median.

Simple tricks for five or more variables, in general, are not in evidence.

Some thought shows that ties pose no problem to any of these tricks. The more awkward assumption is that no values are missing. There is some scope for salvaging problems with missing values. You can calculate the number missing from

. egen nmissing = rowmiss(varlist)

but for a few variables, counting missing variables on the fly is just as easy.

With any missing variables, the median of two can be salvaged by

. gen median = (y1 + y2) / 2 . replace median = max(y1, y2) if median == .

This command sequence exploits the fact that **max(y1, y2)** is
nonmissing whenever one of the values is. If both values are missing, we do
not lose out really, since some missings are just overwritten by missings.
You could write **min(y1, y2)** if that made you feel more comfortable.
(Why is the result the same?)

Similarly, the median of three can be salvaged.

. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)

If **y1**, **y2**, and **y3** are all missing, you already have the
only possible answer: missing. The situations to fix are that just one
variable is missing and that two variables are missing in each observation.

. replace median = max(y1, y2, y3) if (missing(y1) + missing(y2) + missing(y3)) == 2

If two variables are missing, then we can get the other one from
**max()** and
it is automatically the median.

. replace median = ( /// cond(missing(y1), 0, y1) + /// cond(missing(y2), 0, y2) + /// cond(missing(y3), 0, y3) ) / 2 /// if (missing(y1) + missing(y2) + missing(y3)) == 1

If one variable is missing, then the mean of the other two gives the median.
We make sure that missings are ignored in the sum by using 0
instead. If we knew that all variables should be positive (or zero), then
you could use terms like **max(y1, 0)** rather than terms like
**cond(missing(y1), 0, y1)**, and if you knew that all variables should
be negative (or zero), you could use **min(y1, 0)** instead. Further, you
could do the same thing with **egen**:

. egen rowsum = rowsum(y1 y2 y3) . replace median = rowsum / 2 /// if (missing(y1) + missing(y2) + missing(y3)) == 1

Manifestly, although the code so far has some curiosity value in showing
that you don’t always need to **sort** to get row medians, we
really do need a program embodying a more systematic approach.

However, there is no official **egen** function for row medians.
If you seek out user-written functions and look at them carefully, you can
start to see why. In essence, two approaches were possible before Stata 9.

The first, implemented in **egen, rmed()** from STB-57 (Stata 5
required), is to loop over observations, copy values into a variable and
then get the median. Unfortunately, this approach assumes that the number of
variables concerned is no greater than the number of observations. That is
usually but not necessarily true. More importantly, this approach can be
slow indeed.

The second, implemented in **egen, rmedf()** from SSC (Stata 6 required),
is to restructure the dataset on the fly, calculate medians, and then
restructure back. Arguably, restructuring a dataset is not something that
should be done in the middle of an **egen** function, but in any case
this approach could easily fail if enough memory were not available.

With Stata 9, however, comes a more positive opportunity: to use Mata. The
package **egenmore** on SSC includes **egen, rowmedian()**. It is much
faster than previous **egen** functions—even though the basic loop
is still a loop over observations—and it requires little extra memory.
Here is the code:

program _growmedian version 9 gettoken type 0 : 0 gettoken h 0 : 0 gettoken eqs 0 : 0 syntax varlist(numeric) [if] [in] [, BY(string)] if `"`by'"' != "" { _egennoby rowmedian() `"`by'"' /* NOTREACHED */ } marksample touse, novarlist quietly { mata : row_median("`varlist'", "`touse'", "`h'", "`type'") } end mata : void row_median(string scalar varnames, string scalar tousename, string scalar medianname, string scalar type) { real matrix y real colvector median, row real scalar n st_view(y, ., tokens(varnames), tousename) median = J(rows(y), 1, .) for(i = 1; i <= rows(y); i++) { row = y[i,]' if (n = colnonmissing(row)) { // sic _sort(row, 1) median[i] = (row[ceil(n / 2)] + row[floor(n + 2) / 2]) / 2 } } st_addvar(type, medianname) st_store(., medianname, tousename, median) } end