How do I calculate row medians?
|
Title
|
|
Calculating row medians
|
|
Author
|
Nicholas J. Cox, Durham University
|
|
Date
|
March 2007; minor revisions June 2007
|
Calculating medians in statistical software such as Stata should be
easy, and it usually is. When you want a row median—a median
within observations across several variables of the same kind—it
appears trickier.
This problem is worth some discussion, not just because it arises in
practice. It also provides a little example of various parts of
Stata in play.
To recapitulate basics: if you want a median of a variable, you could
sort on that
variable and pick out the median yourself. Some care is needed whenever you
are interested in only some of the observations or there are missing
values. Mostly, users fire up
summarize,
detail, which takes care of all such matters, and identify the
median within the results. It is then left in memory immediately as
r(p50).
If you instead want to put the median into a variable, you can use
egen, median(). This
makes most sense when your data are subdivided into groups and you want
each group median to be recorded for later use. Hence you would use egen,
median() together with
by:.
When you have a bunch of variables of the same kind, sometimes there is an
easy answer. If those variables are really panel or longitudinal data, you
should reshape long
and work with a different data structure. Life will be much easier that way,
not only for row medians, which are now just panel medians, but also for almost
all kinds of analysis you that might want to do with such data.
If you know that the values of your row variables are in
numerical order, so that for example y1 is always less than or equal
to y2, which in turn is less than or equal to y3, and so
forth, then the median can be calculated directly as either one of the
variables or the mean of two of the variables, depending on whether the
number of variables is odd or even. But that would be messed up by any
missing values.
You can get to that situation of row variables in numerical order by using
rowsort from SSC (requires Stata 8; works only with integer values)
or sortrows from SSC (requires Stata 9).
Next come situations when you have few variables (two, three, or
four) over which you want to take the row median—and again no
values are missing. The median of two variables is
the same as their mean, so that first case is easy:
. gen median = (y1 + y2) / 2
A lesser-known trick for three variables also makes solving the problem simple:
. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)
In words: work out the row sum; then subtract the minimum and the
maximum. What remains must be the median.
Now a trick for four variables is immediate:
. gen median = (y1 + y2 + y3 + y4 - ///
min(y1, y2, y3, y4) - max(y1, y2, y3, y4)) / 2
In words: work out the row sum; then subtract the minimum and the
maximum. What remains is the sum of the two inner values, and halving gives
the median.
Simple tricks for five or more variables, in general, are not in evidence.
Some thought shows that ties pose no problem to any of these tricks. The
more awkward assumption is that no values are missing. There is some scope
for salvaging problems with missing values. You can calculate the
number missing from
. egen nmissing = rowmiss(varlist)
but for a few variables, counting missing variables on the fly is just as easy.
With any missing variables, the median of two can be salvaged by
. gen median = (y1 + y2) / 2
. replace median = max(y1, y2) if median == .
This command sequence exploits the fact that max(y1, y2) is
nonmissing whenever one of the values is. If both values are missing, we do
not lose out really, since some missings are just overwritten by missings.
You could write min(y1, y2) if that made you feel more comfortable.
(Why is the result the same?)
Similarly, the median of three can be salvaged.
. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)
If y1, y2, and y3 are all missing, you already have the
only possible answer: missing. The situations to fix are that just one
variable is missing and that two variables are missing in each observation.
. replace median = max(y1, y2, y3) if
(missing(y1) + missing(y2) + missing(y3)) == 2
If two variables are missing, then we can get the other one from
max() and
it is automatically the median.
. replace median = ( ///
cond(missing(y1), 0, y1) + ///
cond(missing(y2), 0, y2) + ///
cond(missing(y3), 0, y3) ) / 2 ///
if (missing(y1) + missing(y2) + missing(y3)) == 1
If one variable is missing, then the mean of the other two gives the median.
We make sure that missings are ignored in the sum by using 0
instead. If we knew that all variables should be positive (or zero), then
you could use terms like max(y1, 0) rather than terms like
cond(missing(y1), 0, y1), and if you knew that all variables should
be negative (or zero), you could use min(y1, 0) instead. Further, you
could do the same thing with egen:
. egen rowsum = rowsum(y1 y2 y3)
. replace median = rowsum / 2 ///
if (missing(y1) + missing(y2) + missing(y3)) == 1
Manifestly, although the code so far has some curiosity value in showing
that you don’t always need to sort to get row medians, we
really do need a program embodying a more systematic approach.
However, there is no official egen function for row medians.
If you seek out user-written functions and look at them carefully, you can
start to see why. In essence, two approaches were possible before Stata 9.
The first, implemented in egen, rmed() from STB-57 (Stata 5
required), is to loop over observations, copy values into a variable and
then get the median. Unfortunately, this approach assumes that the number of
variables concerned is no greater than the number of observations. That is
usually but not necessarily true. More importantly, this approach can be
slow indeed.
The second, implemented in egen, rmedf() from SSC (Stata 6 required),
is to restructure the dataset on the fly, calculate medians, and then
restructure back. Arguably, restructuring a dataset is not something that
should be done in the middle of an egen function, but in any case
this approach could easily fail if enough memory were not available.
With Stata 9, however, comes a more positive opportunity: to use Mata. The
package egenmore on SSC includes egen, rowmedian(). It is much
faster than previous egen functions—even though the basic loop
is still a loop over observations—and it requires little extra memory.
Here is the code:
program _growmedian
version 9
gettoken type 0 : 0
gettoken h 0 : 0
gettoken eqs 0 : 0
syntax varlist(numeric) [if] [in] [, BY(string)]
if `"`by'"' != "" {
_egennoby rowmedian() `"`by'"'
/* NOTREACHED */
}
marksample touse, novarlist
quietly {
mata : row_median("`varlist'", "`touse'", "`h'", "`type'")
}
end
mata :
void row_median(string scalar varnames,
string scalar tousename,
string scalar medianname,
string scalar type)
{
real matrix y
real colvector median, row
real scalar n
st_view(y, ., tokens(varnames), tousename)
median = J(rows(y), 1, .)
for(i = 1; i <= rows(y); i++) {
row = y[i,]'
if (n = colnonmissing(row)) { // sic
_sort(row, 1)
median[i] =
(row[ceil(n / 2)] + row[floor(n + 2) / 2]) / 2
}
}
st_addvar(type, medianname)
st_store(., medianname, tousename, median)
}
end
|