Title | Calculating percentile ranks or plotting positions | |

Author | Nicholas J. Cox, Durham University, UK |

Some kinds of data are often reported as percentile ranks. A test score may be reported as a percentile rank of 95% if 95% of scores are less than or equal to that score. A newborn baby’s weight may be reported in the same way.

The idea of a plotting position is essentially similar, except that conventionally plotting positions are reported as proportions rather than percentages. Their name stems from their use in probability plots. Given a set of ordered values, proportions are assigned recording the fraction of values at or below each value. These proportions may then be used directly or indirectly in plots. For example, a quantile–quantile plot for testing normality of distribution compares observed quantiles with quantiles for a sample of the same size from a normal distribution, determined by evaluating the quantile (inverse distribution) function for the normal at the plotting positions.

Official Stata has no functions for these calculations, although user-defined functions may be found. They may, however, be performed easily using results from other functions. The lack of built-in functions has one advantage: it obliges users to select the precise procedure to be used, given that several slightly different rules exist.

The calculation of the empirical cumulative distribution, at least as
implemented in Stata, is a related but subtly different problem. In official
Stata, this calculation may be performed using
**cumul**.
**cumul** by default appears to be most closely geared to preparing a
plot of cumulative probabilities (*y* axis) versus observed values
(*x* axis). When tied values occur, **cumul** will not (again, by
default) assign the same cumulative probability to each tied value. This is
immaterial and indeed even desirable for graphical purposes: vertical line
segments will be produced whatever the convention about ties, and when
visible plot symbols are used, it is often helpful to get an impression of
the frequency of ties. In contrast, this FAQ is focused mainly on problems
in which identical values should all be assigned the same proportion or
percentage. Incidentally, an **equal** option was added to **cumul**
on 17 June 2003 to override the default and ensure that this condition is
satisfied.

To assign ranks, we might
**sort** on a variable
(say, **mpg** from the auto dataset) and then use the observation number:

. sysuse auto . sort mpg . gen rank = _n

As sorting in Stata puts lowest values first, the lowest value—or more
precisely, the value sorted to the top of the dataset—would be
assigned rank 1. This is also the most common statistical convention.
However, this code is too simple to work perfectly except in the simplest
situation with no tied values and no missing values. For example, looking at
data for **mpg**, we see that the first two values after sorting are tied
at 12, yet these are assigned ranks 1 and 2 by the
**generate** command
above. In statistics it is common to assign the same rank to each of
several tied values such that the sum of the ranks is preserved. Ranks 1 and
2 give a sum of 3 to be preserved, so each would become rank 1.5 instead. We
could carry out this reassignment for all tied values ourselves, but it is
convenient that it is encapsulated in
**egen**’s
**rank()** function:

. egen rank = rank(mpg)

This function also takes care of any missing values. The **sort** command
would sort any (numeric) missing values to the end of the dataset, so that
**gen rank = _n** would then assign them the highest ranks. Usually,
however, a missing value should be matched by a missing rank, and this is
what is done by **egen, rank()**. If you are curious to see how the
reassignment for ties is done, have a look inside the code with your
favorite text editor (Stata’s own
**doedit** would be
fine). Typing

. which _grank

will tell you where the code is on your system. (The prefix **_g** is
generic to all **egen** functions.)

**egen**’s **rank()** function allows separate calculations of
ranks for each of several groups defined by a classifying variable:

. bysort foreign: egen rank = rank(mpg)

For more on **bysort** or, more generally, on
**by**, see the online
manual entry or the tutorial in Cox (2002).

Yet another nuance is being able to rank in reverse. Suppose we preferred to
assign the highest (and best) value of **mpg** rank 1. To do this, we
exploit the fact that **egen, rank()** feeds on an expression (denoted
*exp* in the **egen** syntax diagram), which can be more complicated
than a variable name, so we can work with (in particular) negated values.
Thus,

. egen rank = rank(-mpg)

would reverse the ranks for us. Finally, before leaving ranks as such, note
that **egen, rank()** also has **field**, **track**, and
**unique** options. The name **track** was suggested by track events
such as running in which not only does the lowest time win but two values
that tie for first would also both be ranked first equal. (In sports, no
one talks about rank 1.5.) Similarly, the name **field** was suggested
by field events such as jumping or throwing in which the greatest distance
or height wins, and there is a corresponding rule for ties. **unique**
arbitrarily assigns unique ranks to each of several tied values and is of
less concern here.

Coming now to the heart of the matter, consider a sample of seven values, here ordered for convenience:

0 0.57722 1 1.61803 2.71828 3.14159 10

Evidently, the median, the middle of the ordered values, is 1.61803 and
should be assigned a percentile rank of 50%, or a plotting position of 0.5,
as the value halfway through the distribution. The ranks corresponding to
these values are clearly, with the default Stata (and statistical)
convention of lowest first, 1 2 3 4 5 6 7. Using **i** and **n** to
denote rank and number of values, a fraction of **i**/**n** would
produce plotting positions for this sample of 1/7 ... 7/7, but 4/7 for the
middle value. Nor does such a rule treat the tails symmetrically.
Similarly, a rule of (**i** − 1)/**n** would produce plotting
positions 0/7 ... 6/7, but 3/7 for the middle value, and does not treat the
tails symmetrically either. From this, one obvious compromise is to split
the difference at (**i** − 0.5)/**n**.
This rule was used by the British scientist Francis Galton
(1822–1911), famous for many contributions to statistics, most
notably his work on correlation and regression. In 1914 it was
introduced into statistical hydrology by Allen Hazen (1869–1930), an
American civil engineer in private practice, otherwise best known for
his work on urban water supplies.

There is a little literature on choice of plotting positions, focusing
variously on their use for estimation of parameters or on testing of
hypotheses given specific distributions. Many of the positions commonly
recommended are members of a family (**i** − **a**)/(**n**
− 2**a** + 1). Thus, **a** = 0.5 yields Hazen’s rule;
**a** = 0.375 is a rule suggested by Gunnar Blom (1920–2003),
especially for normal probability plots; **a** = 0 is a rule advocated by
Waloddi Weibull (1887–1979) and Emil J. Gumbel (1891–1966). All
of these rules yield 0.5 for the single middle value when the sample size is
odd. Hazen’s rule is wired into the official Stata command
**quantile**, whereas
the Weibull–Gumbel rule is wired into the official Stata commands
**pnorm**,
**qnorm**,
**pchi**, and
**qchi**. For
more discussion, see Barnett (1975), Cunnane (1978), or Harter (1984).

Another rule, which for example is often used in spreadsheet software, is
**a** = 1, yielding (**i** − 1)/(**n** − 1).
This also yields 0.5 for the
single middle value when sample size is odd and treats the tails
symmetrically. However, the results of 0 and 1 for sample extremes may not
be universally suitable. In particular, for probability plots for the
normal and other distributions on the whole of the real line quantile
functions (inverse distribution functions) do not have finite values for
arguments of 0 and 1; thus, sample extremes are not plottable with this
rule. This rule in practice is often used with the **track** rule for
ties.

The choice of precise rule, from the possibilities mentioned or from any
others that may appeal, is up to the user. All that remains is to calculate
the sample size, **n**. In the simplest circumstance, this is just the
built-in **_N**. To take proper account of missing values, and to be able
to work with multiple groups, it is better to use **egen, count()** to
count the numbers of (nonmissing) values.

Suppose that we want Hazen plotting positions for *varname*:

. egen n = count(varname) . egen i = rank(varname) . gen hazen = (i - 0.5) / n

Or, we want Weibull plotting positions, but separately by *byvar*:

. bybyvar, sort: egen n = count(varname) . bybyvar: egen i = rank(varname) . gen weibull = i / (n + 1)

Or, we want spreadsheet-style percent rank:

. bybyvar, sort: egen n = count(varname) . bybyvar: egen i = rank(varname), track . gen pcrank = (i - 1) / (n - 1)

All of these examples return results between 0 and 1; for percentages, just multiply by 100.

In sum, despite a mass of small details to be thought about, these
calculations boil down to the application of two **egen** functions.

- Barnett, V. 1975.
- Probability plotting methods and order statistics.
*Applied Statistics*24: 95–108.

- Cox, N. J. 2002.
- Speaking Stata: how to move step by: step.
*Stata Journal*2: 86–102.

- Cunnane, C. 1978.
- Unbiased plotting positions—a review.
*Journal of Hydrology*37: 205–222.

- Galton, F. 1883.
*Inquiries into Human Faculty and its Development*. London: Macmillan.

- Galton, F. 1907.
- Grades and deviates.
*Biometrika*5: 400–406.

- Harter, H. L. 1984.
- Another look at plotting positions.
*Communications in Statistics, Theory and Methods*13: 1613–1633.