Thanks to Kit Baum, the -egenmore- package on SSC has been updated. The
addition is a single -egen- function, -rowmedian()-, which calculates
row medians (surprise!). The program uses Mata, so Stata 9 is required.
Thanks to Bill Gould also for several very helpful comments on drafts of
the code.
Use -ssc- to install or update -egenmore-. Other functions in the package
mostly work with earlier versions of Stata too.
The question of row medians arises from time to time, most recently on
Statalist in a thread started by Christer Thrane. What follows is a
zeroth draft of a FAQ to be offered to StataCorp (they might decline!),
so corrections and complementary [sic] remarks are welcome.
Calculating medians in statistical software such as Stata should be easy
-- and it usually is. When you want a row median -- a median within
observations across several variables of the same kind -- it appears
trickier.
This problem is worth a little discussion, not just because it arises in
practice. It also provides a little tutorial example of various parts of
Stata in play.
To recapitulate basics: If you want a median of a variable, you could
-sort- on that variable and pick out the median yourself. Some care is
needed whenever you are only interested in some of the observations, or
there are missing values. Mostly, users fire up -summarize, detail-,
which takes care of all such matters, and identify the median within the
results. It is left in memory immediately afterwards as -r(p50)-.
Alternatively, if you want to put the median into a variable, you can
use -egen, median()-. This makes most sense when your data are
subdivided into groups, and you want each group median to be recorded
for later use. Hence you would use -egen, median()- together with -by:-.
When you have a bunch of variables of the same kind, sometimes there is
an easy answer. First off, if those variables are really panel or
longitudinal data, you should -reshape long- and work with a different
data structure. Life will be much easier that way, not just for row
medians, which are now just panel medians, but for almost all kind of
analysis you might want to do.
Next, if exceptionally you know that the values of your row variables
are in numerical order, so that for example -y1- is always less than or
equal to -y2-, which in turn is less than or equal to -y3-, and so
forth, then the median can be calculated directly as either one of the
variables or the mean of two of the variables, depending on whether the
number of variables is odd or even. But that would be messed up by any
missing values.
You can get to that situation of row variables in numerical order by
using -rowsort- from SSC (requires Stata 8; works only with integer
values) or -sortrows- from SSC (requires Stata 9).
Next up are situations in which you have a small number of variables
over which you want to take the row median -- and again no observations
are missing. You will know that the median of two variables is the same
as their mean, so that first case is easy:
. gen median = (y1 + y2) / 2
A less well-known trick for three variables also makes light of the
problem:
. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)
In words: Work out the row sum; then subtract the minimum and also the
maximum. What remains must be the median.
Now a trick for four variables is immediate:
. gen median = (y1 + y2 + y3 + y4 - ///
min(y1, y2, y3, y4) - max(y1, y2, y3, y4)) / 2
In words: Work out the row sum; then subtract the minimum and also the
maximum. What remains is the sum of the two inner values and halving
gives the median.
Simple tricks for five or more variables, in general, are not in
evidence.
Some thought shows that ties pose no problem to any of these tricks. The
more awkward assumption is that no values are missing. There is some
scope for salvaging problems with missing values. Know that you can
calculate the number missing from
. egen nmissing = rowmiss(<varlist>)
but for a few variables it is as easy to count missings on the fly.
With any missings, the median of two can be salvaged by
. gen median = (y1 + y2) / 2
. replace median = max(y1, y2) if median == .
This exploits the fact that -max(y1, y2)- is non-missing whenever one of
the values is. If both values are missing, we do not lose out really, as
some missings are just over-written by missings. You could write
-min(y1, y2)- if that made you feel more comfortable. (Why is the result
exactly the same?)
Similarly, the median of three can be salvaged.
. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3)
If -y1-, -y2-, -y3- are all missing, then this already gives the only
possible answer, missing. The situations to fix are that just one
variable is missing and that two variables are missing in each
observation.
. replace median = max(y1, y2, y3) if
(missing(y1) + missing(y2) + missing(y3)) == 2
If two variables are missing, then we can get the other one from -max()-
and it is automatically the median.
. replace median = ( ///
cond(missing(y1), 0, y1) + ///
cond(missing(y2), 0, y2) + ///
cond(missing(y3), 0, y3) ) / 2 ///
if (missing(y1) + missing(y2) + missing(y3)) == 1
If one variable is missing, then the mean of the other two gives the
median. Note how we make sure that missings are ignored in the sum by
using 0 instead. If we knew that all variables should be strictly
positive, then you could use terms like -max(y1, 0)- rather than terms
like -cond(missing(y1), 0, y1)-, and if you knew that all variables
should be strictly negative you could use -min(y1, 0)- instead.
Further, you could do that with -egen-:
. egen rowsum = rowsum(y1 y2 y3)
. replace median = rowsum / 2 ///
if (missing(y1) + missing(y2) + missing(y3)) == 1
Manifestly, although the code has some curiosity value in showing that
you don't always need to sort to get row medians, we really do need a
program embodying a more systematic approach.
However, there is no official -egen- function for row medians and you
may wonder why not. If you seek out user-written functions and look at
them carefully, you can start to see why. In essence, two approaches were
possible before Stata 9.
The first, implemented in -egen, rmed()- from STB-57 (Stata 5 required),
is to loop over observations, copy values into a variable and then get
the median. Unfortunately, this assumes that the number of variables
concerned is no greater than the number of observations. That is usually
but not necessarily true. More importantly, this approach can be very
slow indeed.
The second, implemented in -egen, rmedf()- from SSC (Stata 6 required),
is to restructure the dataset on the fly, calculate medians, and then
restructure back. Arguably, restructuring a dataset is not something
that should be done in the middle of an -egen- function, but in any case
this could easily fail if enough memory were not available.
With Stata 9, however, comes a more positive opportunity: to use Mata.
The package -egenmore- on SSC included -egen, rowmedian()-. It is much,
much faster than previous -egen- functions -- even though the basic loop
is still a loop over observations -- and it requires very little extra
memory. Here is the code for those interested:
program _growmedian
version 9
gettoken type 0 : 0
gettoken h 0 : 0
gettoken eqs 0 : 0
syntax varlist(numeric) [if] [in] [, BY(string)]
if `"`by'"' != "" {
_egennoby rowmedian() `"`by'"'
/* NOTREACHED */
}
marksample touse, novarlist
quietly {
mata : row_median("`varlist'", "`touse'", "`h'", "`type'")
}
end
mata :
void row_median(string scalar varnames,
string scalar tousename,
string scalar medianname,
string scalar type)
{
real matrix y
real colvector median, row
real scalar n
st_view(y, ., tokens(varnames), tousename)
median = J(rows(y), 1, .)
for(i = 1; i <= rows(y); i++) {
row = y[i,]'
if (n = colnonmissing(row)) { // sic
_sort(row, 1)
median[i] =
(row[ceil(n / 2)] + row[floor(n + 2) / 2]) / 2
}
}
st_addvar(type, medianname)
st_store(., medianname, tousename, median)
}
end
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/