Search
   >> Home >> Resources & support >> FAQs >> Calculating row medians
Note: This FAQ is for Stata 10 and older versions. In Stata 11, the function rowmedian allows you to compute row medians directly with egen.

How do I calculate row medians?

Title   Calculating row medians
Author Nicholas J. Cox, Durham University
Date March 2007; minor revisions June 2007

Calculating medians in statistical software such as Stata should be easy, and it usually is. When you want a row median—a median within observations across several variables of the same kind—it appears trickier.

This problem is worth some discussion, not just because it arises in practice. It also provides a little example of various parts of Stata in play.

To recapitulate basics: if you want a median of a variable, you could sort on that variable and pick out the median yourself. Some care is needed whenever you are interested in only some of the observations or there are missing values. Mostly, users fire up summarize, detail, which takes care of all such matters, and identify the median within the results. It is then left in memory immediately as r(p50).

If you instead want to put the median into a variable, you can use egen, median(). This makes most sense when your data are subdivided into groups and you want each group median to be recorded for later use. Hence you would use egen, median() together with by:.

When you have a bunch of variables of the same kind, sometimes there is an easy answer. If those variables are really panel or longitudinal data, you should reshape long and work with a different data structure. Life will be much easier that way, not only for row medians, which are now just panel medians, but also for almost all kinds of analysis you that might want to do with such data.

If you know that the values of your row variables are in numerical order, so that for example y1 is always less than or equal to y2, which in turn is less than or equal to y3, and so forth, then the median can be calculated directly as either one of the variables or the mean of two of the variables, depending on whether the number of variables is odd or even. But that would be messed up by any missing values.

You can get to that situation of row variables in numerical order by using rowsort from SSC (requires Stata 8; works only with integer values) or sortrows from SSC (requires Stata 9).

Next come situations when you have few variables (two, three, or four) over which you want to take the row median—and again no values are missing. The median of two variables is the same as their mean, so that first case is easy:

. gen median = (y1 + y2) / 2 

A lesser-known trick for three variables also makes solving the problem simple:

. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3) 

In words: work out the row sum; then subtract the minimum and the maximum. What remains must be the median.

Now a trick for four variables is immediate:

. gen median = (y1 + y2 + y3 + y4 - ///
                min(y1, y2, y3, y4) - max(y1, y2, y3, y4)) / 2 

In words: work out the row sum; then subtract the minimum and the maximum. What remains is the sum of the two inner values, and halving gives the median.

Simple tricks for five or more variables, in general, are not in evidence.

Some thought shows that ties pose no problem to any of these tricks. The more awkward assumption is that no values are missing. There is some scope for salvaging problems with missing values. You can calculate the number missing from

. egen nmissing = rowmiss(varlist)

but for a few variables, counting missing variables on the fly is just as easy.

With any missing variables, the median of two can be salvaged by

. gen median = (y1 + y2) / 2 
. replace median = max(y1, y2) if median == . 

This command sequence exploits the fact that max(y1, y2) is nonmissing whenever one of the values is. If both values are missing, we do not lose out really, since some missings are just overwritten by missings. You could write min(y1, y2) if that made you feel more comfortable. (Why is the result the same?)

Similarly, the median of three can be salvaged.

. gen median = y1 + y2 + y3 - min(y1, y2, y3) - max(y1, y2, y3) 

If y1, y2, and y3 are all missing, you already have the only possible answer: missing. The situations to fix are that just one variable is missing and that two variables are missing in each observation.

. replace median = max(y1, y2, y3) if 
	(missing(y1) + missing(y2) + missing(y3)) == 2 

If two variables are missing, then we can get the other one from max() and it is automatically the median.

. replace median = ( ///
        cond(missing(y1), 0, y1) + /// 
        cond(missing(y2), 0, y2) + ///
        cond(missing(y3), 0, y3) ) / 2 ///
        if (missing(y1) + missing(y2) + missing(y3)) == 1 

If one variable is missing, then the mean of the other two gives the median. We make sure that missings are ignored in the sum by using 0 instead. If we knew that all variables should be positive (or zero), then you could use terms like max(y1, 0) rather than terms like cond(missing(y1), 0, y1), and if you knew that all variables should be negative (or zero), you could use min(y1, 0) instead. Further, you could do the same thing with egen:

. egen rowsum = rowsum(y1 y2 y3)
. replace median = rowsum / 2 ///
	if (missing(y1) + missing(y2) + missing(y3)) == 1 

Manifestly, although the code so far has some curiosity value in showing that you don’t always need to sort to get row medians, we really do need a program embodying a more systematic approach.

However, there is no official egen function for row medians. If you seek out user-written functions and look at them carefully, you can start to see why. In essence, two approaches were possible before Stata 9.

The first, implemented in egen, rmed() from STB-57 (Stata 5 required), is to loop over observations, copy values into a variable and then get the median. Unfortunately, this approach assumes that the number of variables concerned is no greater than the number of observations. That is usually but not necessarily true. More importantly, this approach can be slow indeed.

The second, implemented in egen, rmedf() from SSC (Stata 6 required), is to restructure the dataset on the fly, calculate medians, and then restructure back. Arguably, restructuring a dataset is not something that should be done in the middle of an egen function, but in any case this approach could easily fail if enough memory were not available.

With Stata 9, however, comes a more positive opportunity: to use Mata. The package egenmore on SSC includes egen, rowmedian(). It is much faster than previous egen functions—even though the basic loop is still a loop over observations—and it requires little extra memory. Here is the code:

 program _growmedian
          version 9
          gettoken type 0 : 0
          gettoken h    0 : 0 
          gettoken eqs  0 : 0
          
          syntax varlist(numeric) [if] [in] [, BY(string)]
          if `"`by'"' != "" {
		  _egennoby rowmedian() `"`by'"'
		  /* NOTREACHED */
          }
        
          marksample touse, novarlist 
          quietly { 
                  mata : row_median("`varlist'", "`touse'", "`h'", "`type'") 
          }
 end

 mata : 

 void row_median(string scalar varnames, 
                 string scalar tousename,
                 string scalar medianname,
                 string scalar type)
 { 
         real matrix y 
         real colvector median, row
         real scalar n

         st_view(y, ., tokens(varnames), tousename)    
         median = J(rows(y), 1, .) 

         for(i = 1; i <= rows(y); i++) { 
                 row = y[i,]'        
                 if (n = colnonmissing(row)) { // sic 
                         _sort(row, 1)
                         median[i] = 
                                 (row[ceil(n / 2)] + row[floor(n + 2) / 2]) / 2
                 }
          }

          st_addvar(type, medianname)
          st_store(., medianname, tousename, median) 
 }	

 end
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube