How can I calculate percentile ranks?
How can I calculate plotting positions?
|
Title
|
|
Calculating percentile ranks or plotting positions
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
July 2002; minor revision July 2005
|
1. The problem
Some kinds of data are often reported as percentile ranks. A test score may
be reported as a percentile rank of 95% if 95% of scores are less than or
equal to that score. A newborn baby’s weight may be reported in the
same way.
The idea of a plotting position is essentially similar, except that
conventionally plotting positions are reported as proportions rather than
percentages. Their name stems from their use in probability plots. Given a
set of ordered values, proportions are assigned recording the fraction of
values at or below each value. These proportions may then be used directly
or indirectly in plots. For example, a quantile–quantile plot for
testing normality of distribution compares observed quantiles with quantiles
for a sample of the same size from a normal distribution, determined by
evaluating the quantile (inverse distribution) function for the normal at
the plotting positions.
Official Stata has no functions for these calculations, although
user-defined functions may be found. They may, however, be performed easily
using results from other functions. The lack of built-in functions has one
advantage: it obliges users to select the precise procedure to be used,
given that several slightly different rules exist.
The calculation of the empirical cumulative distribution, at least as
implemented in Stata, is a related but subtly different problem. In official
Stata, this calculation may be performed using
cumul.
cumul by default appears to be most closely geared to preparing a
plot of cumulative probabilities (y-axis) versus observed values
(x-axis). When tied values occur, cumul will not (again, by
default) assign the same cumulative probability to each tied value. This is
immaterial and indeed even desirable for graphical purposes: vertical line
segments will be produced whatever the convention about ties, and when
visible plot symbols are used, it is often helpful to get an impression of
the frequency of ties. In contrast, this FAQ is focused mainly on problems
in which identical values should all be assigned the same proportion or
percentage. Incidentally, an equal option was added to cumul
on 17 June 2003 to override the default and ensure that this condition is
satisfied.
2. Ranks
To assign ranks, we might
sort on a variable
(say, mpg from the auto dataset) and then use the observation number:
. sysuse auto
. sort mpg
. gen rank = _n
As sorting in Stata puts lowest values first, the lowest value—or more
precisely, the value sorted to the top of the dataset—would be
assigned rank 1. This is also the most common statistical convention.
However, this code is too simple to work perfectly except in the simplest
situation with no tied values and no missing values. For example, looking at
data for mpg, we see that the first two values after sorting are tied
at 12, yet these are assigned ranks 1 and 2 by the
generate command
above. In statistics it is common to assign the same rank to each of
several tied values such that the sum of the ranks is preserved. Ranks 1 and
2 give a sum of 3 to be preserved, so each would become rank 1.5 instead. We
could carry out this reassignment for all tied values ourselves, but it is
convenient that it is encapsulated in
egen’s
rank() function:
. egen rank = rank(mpg)
This function also takes care of any missing values. The sort command
would sort any (numeric) missing values to the end of the dataset, so that
gen rank = _n would then assign them the highest ranks. Usually,
however, a missing value should be matched by a missing rank, and this is
what is done by egen, rank(). If you are curious to see how the
reassignment for ties is done, have a look inside the code with your
favorite text editor (Stata’s own
doedit would be
fine). Typing
. which _grank
will tell you where the code is on your system. (The prefix _g is
generic to all egen functions.)
egen’s rank() function allows separate calculations of
ranks for each of several groups defined by a classifying variable:
. bysort foreign: egen rank = rank(mpg)
For more on bysort or, more generally, on
by, see the online
help or the tutorial in Cox (2002).
Yet another nuance is being able to rank in reverse. Suppose we preferred to
assign the highest (and best) value of mpg rank 1. To do this, we
exploit the fact that egen, rank() feeds on an expression (denoted
exp in the egen syntax diagram), which can be more complicated
than a variable name, so we can work with (in particular) negated values.
Thus,
. egen rank = rank(-mpg)
would reverse the ranks for us. Finally, before leaving ranks as such, note
that egen, rank() also has field, track, and
unique options. The name track was suggested by track events
such as running in which not only does the lowest time win but two values
that tie for first would also both be ranked first equal. (In sports, no
one talks about rank 1.5.) Similarly, the name field was suggested
by field events such as jumping or throwing in which the greatest distance
or height wins, and there is a corresponding rule for ties. unique
arbitrarily assigns unique ranks to each of several tied values and is of
less concern here.
3. Plotting positions
Coming now to the heart of the matter, consider a sample of seven values,
here ordered for convenience:
0 0.57722 1 1.61803 2.71828 3.14159 10
Evidently, the median, the middle of the ordered values, is 1.61803 and
should be assigned a percentile rank of 50%, or a plotting position of 0.5,
as the value halfway through the distribution. The ranks corresponding to
these values are clearly, with the default Stata (and statistical)
convention of lowest first, 1 2 3 4 5 6 7. Using i and n to
denote rank and number of values, a fraction of i/n would
produce plotting positions for this sample of 1/7 ... 7/7, but 4/7 for the
middle value. Nor does such a rule treat the tails symmetrically.
Similarly, a rule of (i − 1)/n would produce plotting
positions 0/7 ... 6/7, but 3/7 for the middle value, and does not treat the
tails symmetrically either. From this, one obvious compromise is to split
the difference at (i − 0.5)/n. This rule goes back as
least as far as 1914, when it was introduced into statistical hydrology by
Allen Hazen (1869–1930), an American civil engineer in private
practice, otherwise best known for his work on urban water supplies.
There is a little literature on choice of plotting positions, focusing
variously on their use for estimation of parameters or on testing of
hypotheses given specific distributions. Many of the positions commonly
recommended are members of a family (i − a)/(n
− 2a + 1). Thus, a = 0.5 yields Hazen’s rule;
a = 0.375 is a rule suggested by Gunnar Blom (1920–2003),
especially for normal probability plots; a = 0 is a rule advocated by
Waloddi Weibull (1887–1979) and Emil J. Gumbel (1891–1966). All
of these rules yield 0.5 for the single middle value when the sample size is
odd. Hazen’s rule is wired into the official Stata command
quantile, whereas
the Weibull–Gumbel rule is wired into the official Stata commands
pnorm,
qnorm,
pchi, and
qchi. For
more discussion, see Barnett (1975), Cunnane (1978), or Harter (1984).
Another rule, which for example is often used in spreadsheet software, is
(i − 1)/(n − 1). This also yields 0.5 for the
single middle value when sample size is odd and treats the tails
symmetrically. However, the results of 0 and 1 for sample extremes may not
be universally suitable. In particular, for probability plots for the
normal and other distributions on the whole of the real line quantile
functions (inverse distribution functions) do not have finite values for
arguments of 0 and 1; thus, sample extremes are not plottable with this
rule. This rule in practice is often used with the track rule for
ties.
4. Obtaining results in Stata
The choice of precise rule, from the possibilities mentioned or from any
others that may appeal, is up to the user. All that remains is to calculate
the sample size, n. In the simplest circumstance, this is just the
built-in _N. To take proper account of missing values, and to be able
to work with multiple groups, it is better to use egen, count() to
count the numbers of (nonmissing) values.
Suppose that we want Hazen plotting positions for varname:
. egen n = count(varname)
. egen i = rank(varname)
. gen hazen = (i - 0.5) / n
Or, we want Weibull plotting positions, but separately by byvar:
. by byvar, sort: egen n = count(varname)
. by byvar: egen i = rank(varname)
. gen weibull = i / (n + 1)
Or, we want spreadsheet-style percent rank:
. by byvar, sort: egen n = count(varname)
. by byvar: egen i = rank(varname), track
. gen pcrank = (i - 1) / (n - 1)
All of these examples return results between 0 and 1; for percentages, just
multiply by 100.
In sum, despite a mass of small details to be thought about, these
calculations boil down to the application of two egen functions.
References
- Barnett, V. 1975.
- Probability plotting methods and order statistics.
Applied Statistics 24: 95–108.
- Cox, N. J. 2002.
- Speaking Stata: how to move step by: step.
Stata Journal 2: 86–102.
- Cunnane, C. 1978.
- Unbiased plotting positions—a review.
Journal of Hydrology 37: 205–222.
- Harter, H. L. 1984.
- Another look at plotting positions.
Communications in Statistics, Theory and Methods 13: 1613–1633.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|