How can I best get box plots on logarithmic scales?
|
Title
|
|
Box plots and logarithmic scales
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2005
|
The problem
The purpose of this FAQ is to point out a potential pitfall with
graph box and
graph hbox
and to explain a way around it. Sometimes
users fire up a box plot in Stata, realize that a logarithmic scale
would be better for their data, and then ask for that
by yscale(log) (with either graph box or graph hbox).
(From now on examples will be just in terms of graph box, as the principle is
the same for both.)
Although Stata will let you do this, you should be aware of what option
yscale(log) actually does. As with all other graphs,
yscale(log) takes the graph you would have gotten otherwise and warps it
logarithmically.
What it does not do is recalculate summaries on the log scale, which, with
a box plot, is what you might want. However, making yscale(log) have a
special meaning for box plots would be bad software design, whatever the
statistical arguments, so if the pitfall to be discussed here matters to you,
then you will need to work your way around it.
In what follows, I assume that each variable to be shown on a box plot is all
positive, because logarithmic transformation is not defined otherwise. As you
may have noticed, if you ask graph to use yscale(log) when
zero or negative values are present, it just gives you a ridiculous graph,
rather like the kind of teacher who will not say, "That was a stupid thing to
ask," but will just give you a funny look that clearly means, "Do think about
that a bit more."
Methods for box plots differ by book and by program.
Frigge, Hoaglin, and Iglewicz (1989) cataloged several variants, and no
doubt a careful search would reveal some they missed and others that
have arisen since. Stata follows what Tukey (1977) settled on after
trying various possibilities. The most important detail is that a data
point is plotted separately if it lies more than 1.5 times the
interquartile range away from the nearer quartile. This calculation
depends on the scale being used: if you redo it on a logarithmic scale,
you will often get a different decision. Points declared as deserving
separate plotting on the original scale may not be so declared on the
logarithmic scale. Thus some high values plotted separately may jump
back inside the main box-and-whiskers cluster. Conversely, some low
values may jump out of that cluster and now be declared as deserving
separate plotting. The reclassifications reflect that the
interquartile range of logarithms is not, in general, the logarithm of the
interquartile range.
The same issue can affect, although usually not as much, the
calculated median and quartiles. Each can be based on interpolation
between data points, and so it is not always true that, say, the median
of the logarithms is exactly the same as the logarithm of the median.
Unless your dataset is strange and small, you would not
usually be troubled by the difference, but only for the minimum and
maximum is there never a small problem.
The solution
The way to do it properly is thus to take logarithms first. For example,
. sysuse auto, clear
. clonevar log10price = price
. replace log10price = log10(price)
. graph box log10price
The reason for starting with
clonevar
is that you pick up the variable label. log10() has the marginal
advantage over log() (or ln()) that you can calculate the
inverse, the power of 10, more easily. This method can prove useful when you
want to add more intelligible axis labels to the graph. Seeing the graph
above, you can think that 4 means 104 or 10,000, leading to an extra
specification:
. graph box log10price, ylabel(4 "10000")
However, here extra labels such as 3 "1000" 5 "100000"
are too far outside the range of the data. Few of us can recall more
than the integer powers of 10, but Stata can do the calculation on the
fly. To add labels at the equivalents of 5,000 and 15,000, type
. graph box log10price, ///
ylabel(`=log10(5000)' "5000" 4 "10000" `=log10(15000)' "15000")
The help for this trick is at
help macro.
However, having to spell out several label specifications in this way is at
best a little tedious. To get the best of all worlds, you will want nice
axis labels on the original scale, but with Stata doing all the work that you would
rather not do. One way of getting those is through the program mylabels
from SSC.
You just say what you want shown and specify the scale in use. For
example, after
. mylabels 3000(2000)7000 9000(3000)15000, myscale(log10(@)) local(labels)
Stata echoes
3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918
"12000" 4.17609 "15000"
You should not retype that, or even copy and paste, because it is tucked
safe inside the local macro specified.
. graph box log10price, ylabel(`labels', angle(h))
It may take a few iterations to get it right, but simply
reissue mylabels until you do. Use the same local macro name.
Stata is happy to overwrite it, as local macros are expendable.
Other nonlinear scales
The same issue with box plots and change of scale arises with any
nonlinear transformation. The calculation of median and quartiles and
the selection of data points for separate plotting need to be done
afresh on any new scale. For example, psychologists and others work with
times taken by test subjects to complete a task. The distributions
are often highly skewed and subjects who do not complete should be
assigned missing values. The reciprocal of time is a speed;
missing times can be recoded as zero speeds. Here again you would need
to do the transformation yourself and possibly fix the axis labels, too.
References
- Frigge, M., D. C. Hoaglin, and B. Iglewicz. 1989.
- Some implementations of the box plot. American Statistician 43: 50–54.
- Tukey, J. W. 1977.
- Exploratory Data Analysis. Reading, MA: Addison–Wesley.
|
|