How can I best get box plots on logarithmic scales?
|
Title
|
|
Box plots and logarithmic scales
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2005
|
The problem
The purpose of this FAQ is to point out a potential pitfall with
graph box and
graph hbox
and to explain a way around it. Sometimes users fire up a box plot in Stata,
realize that a logarithmic scale would be better for their data, and then
ask for that
by yscale(log) (with either graph box or graph hbox).
(From now on examples will be just in terms of graph box, as the principle is
the same for both.)
Although Stata will let you do this, you should be aware of what option
yscale(log) actually does. As with all other graphs,
yscale(log) takes the graph you would have gotten otherwise and warps
it logarithmically. What it does not do is recalculate summaries on the log
scale, which, with a box plot, is what you might want. However, making
yscale(log) have a special meaning for box plots would be bad
software design, whatever the statistical arguments, so if the pitfall to be
discussed here matters to you, then you will need to work your way around
it.
In what follows, I assume that each variable to be shown on a box plot is
all positive, because logarithmic transformation is not defined otherwise.
As you may have noticed, if you ask graph to use yscale(log)
when zero or negative values are present, it just gives you a ridiculous
graph, rather like the kind of teacher who will not say, “That was a
stupid thing to ask,” but will just give you a funny look that clearly
means, “Do think about that a bit more.”
Methods for box plots differ by book and by program. Frigge, Hoaglin, and
Iglewicz (1989) cataloged several variants, and no doubt a careful search
would reveal some they missed and others that have arisen since. Stata
follows what Tukey (1977) settled on after trying various possibilities. The
most important detail is that a data point is plotted separately if it lies
more than 1.5 times the interquartile range away from the nearer quartile.
This calculation depends on the scale being used: if you redo it on a
logarithmic scale, you will often get a different decision. Points declared
as deserving separate plotting on the original scale may not be so declared
on the logarithmic scale. Thus some high values plotted separately may jump
back inside the main box-and-whiskers cluster. Conversely, some low values
may jump out of that cluster and now be declared as deserving separate
plotting. The reclassifications reflect that the interquartile range of
logarithms is not, in general, the logarithm of the interquartile range.
The same issue can affect, although usually not as much, the calculated
median and quartiles. Each can be based on interpolation between data
points, and so it is not always true that, say, the median of the logarithms
is exactly the same as the logarithm of the median. Unless your dataset is
strange and small, you would not usually be troubled by the difference, but
only for the minimum and maximum is there never a small problem.
The solution
The way to do it properly is thus to take logarithms first. For example,
. sysuse auto, clear
. clonevar log10price = price
. replace log10price = log10(price)
. graph box log10price
The reason for starting with
clonevar is that
you pick up the variable label. log10() has the marginal advantage
over log() (or ln()) that you can calculate the inverse, the
power of 10, more easily. This method can prove useful when you want to add
more intelligible axis labels to the graph. Seeing the graph above, you can
think that 4 means 104 or 10,000, leading to an extra
specification:
. graph box log10price, ylabel(4 "10000")
However, here extra labels such as 3 “1000” 5
“100000” are too far outside the range of the data. Few of
us can recall more than the integer powers of 10, but Stata can do the
calculation on the fly. To add labels at the equivalents of 5,000 and
15,000, type
. graph box log10price, ///
ylabel(`=log10(5000)' "5000" 4 "10000" `=log10(15000)' "15000")
The help for this trick is at
help macro.
However, having to spell out several label specifications in this way is at
best a little tedious. To get the best of all worlds, you will want nice
axis labels on the original scale, but with Stata doing all the work that
you would rather not do. One way of getting those is through the program
mylabels from SSC.
You just say what you want shown and specify the scale in use. For example,
after
. mylabels 3000(2000)7000 9000(3000)15000, myscale(log10(@)) local(labels)
Stata echoes
3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918
"12000" 4.17609 "15000"
You should not retype that, or even copy and paste, because it is tucked
safe inside the local macro specified.
. graph box log10price, ylabel(`labels', angle(h))
It may take a few iterations to get it right, but simply reissue
mylabels until you do. Use the same local macro name. Stata is happy
to overwrite it, as local macros are expendable.
Other nonlinear scales
The same issue with box plots and change of scale arises with any nonlinear
transformation. The calculation of median and quartiles and the selection of
data points for separate plotting need to be done afresh on any new scale.
For example, psychologists and others work with times taken by test subjects
to complete a task. The distributions are often highly skewed and subjects
who do not complete should be assigned missing values. The reciprocal of
time is a speed; missing times can be recoded as zero speeds. Here again
you would need to do the transformation yourself and possibly fix the axis
labels, too.
References
- Frigge, M., D. C. Hoaglin, and B. Iglewicz. 1989.
- Some implementations of the box plot. American Statistician 43: 50–54.
- Tukey, J. W. 1977.
- Exploratory Data Analysis. Reading, MA: Addison–Wesley.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|