Home  /  Resources & support  /  FAQs  /  Box plots and logarithmic scales

How can I best get box plots on logarithmic scales?

Title   Box plots and logarithmic scales
Author Nicholas J. Cox, Durham University, UK

The problem

The purpose of this FAQ is to point out a potential pitfall with graph box and graph hbox and to explain a way around it. Sometimes, users fire up a box plot in Stata, realize that a logarithmic scale would be better for their data, and then ask for that by yscale(log) (with either graph box or graph hbox). (From now on, examples will be just in terms of graph box, as the principle is the same for both.)

Although Stata will let you do this, you should be aware of what option yscale(log) actually does. As with all other graphs, yscale(log) takes the graph you would have gotten otherwise and warps it logarithmically. What it does not do is recalculate summaries on the log scale, which, with a box plot, is what you might want. However, making yscale(log) have a special meaning for box plots would be bad software design, whatever the statistical arguments, so if the pitfall to be discussed here matters to you, then you will need to work your way around it.

In what follows, I assume that each variable to be shown on a box plot is all positive, because logarithmic transformation is not defined otherwise. As you may have noticed, if you ask graph to use yscale(log) when zero or negative values are present, it just gives you a ridiculous graph, rather like the kind of teacher who will not say, “That was a stupid thing to ask,” but will just give you a funny look that clearly means, “Do think about that a bit more.”

Methods for box plots differ by book and by program. Frigge, Hoaglin, and Iglewicz (1989) cataloged several variants, and no doubt a careful search would reveal some they missed and others that have arisen since. Stata follows what Tukey (1977) settled on after trying various possibilities. The most important detail is that a data point is plotted separately if it lies more than 1.5 times the interquartile range away from the nearer quartile. This calculation depends on the scale being used: if you redo it on a logarithmic scale, you will often get a different decision. Points declared as deserving separate plotting on the original scale may not be so declared on the logarithmic scale. Thus, some high values plotted separately may jump back inside the main box-and-whiskers cluster. Conversely, some low values may jump out of that cluster and now be declared as deserving separate plotting. The reclassifications reflect that the interquartile range of logarithms is not generally the logarithm of the interquartile range.

The same issue can affect, although usually not as much, the calculated median and quartiles. Each can be based on interpolation between data points, and so it is not always true that, say, the median of the logarithms is exactly the same as the logarithm of the median. Unless your dataset is strange and small, you would not usually be troubled by the difference, but only for the minimum and maximum is there never a small problem.

The solution

The way to do it properly is thus to take logarithms first. For example,

. sysuse auto, clear
. clonevar log10price = price
. replace log10price = log10(price) 
. graph box log10price

Graph 1

The reason for starting with clonevar is that you pick up the variable label. log10() has the marginal advantage over log() (or ln()) that you can calculate the inverse, the power of 10, more easily. This method can prove useful when you want to add more intelligible axis labels to the graph. Seeing the graph above, you can think that 4 means 104 or 10,000, leading to an extra specification:

. graph box log10price, ylabel(4 "10000")

However, here extra labels such as 3 “1000” 5 “100000” are too far outside the range of the data. Few of us can recall more than the integer powers of 10, but Stata can do the calculation on the fly. To add labels at the equivalents of 5,000 and 15,000, type

. graph box log10price,
    ylabel(`=log10(5000)' "5000" 4 "10000" `=log10(15000)' "15000")

The help for this trick is at [P] macro. However, having to spell out several label specifications in this way is at best a little tedious. To get the best of all worlds, you will want nice axis labels on the original scale but with Stata doing all the work that you would rather not do. One way of getting those is through the program mylabels from SSC. You just say what you want shown and specify the scale in use. For example, after

. mylabels 3000(2000)7000 9000(3000)15000, myscale(log10(@)) local(labels)

Stata echoes

 3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918 
 "12000" 4.17609 "15000"

You should not retype that, or even copy and paste it, because it is tucked safe inside the local macro specified as

. graph box log10price, ylabel(`labels', angle(h))

Graph 2

It may take a few iterations to get it right, but simply reissue mylabels until you do. Use the same local macro name. Stata is happy to overwrite it because local macros are expendable.

Update October 2023

See the community-contributed command box_logscale by Mark Chatfield, posted in SSC in October 2023. This command implements the advice described above.

. ssc install box_logscale
 *Generate a dataset with a lognormal variable
. clear all
. set seed 999
. set obs 999
. gen y = 10^rnormal(0,0.3)
 *Create a box plot in two different ways
. graph box y, yscale(log)
. box_logscale y  //  new community-contributed command

Graph 2 Graph 2

Left graph: There are so many high outside values and no low outside values. And a badly labeled numeric axis.

Right graph: There are just a few high outside values and just a few low outside values. And a nicely labeled numeric axis.

Notes for box_logscale (quoted from help box_logscale):

box_logscale calculates quartiles of log10(y) in the usual manner described in [G-2] graph box. Call them q1log, q2log, q3log. Then the following are calculated: Ulog = q3log + 1.5*(q3log - q1log) and Llog = q1log - 1.5*(q3log - q1log). Then adjacent values for log10(y) are defined in the usual manner described in [G-2] graph box, i.e. the upper adjacent value is the largest value of log10(y) not exceeding Ulog, and the lower adjacent value is the smallest value of log10(y) exceeding Llog. The plot is then drawn, and labels on the numeric axis are carefully chosen such that they correspond to (nice, possibly user specified) original-scale values - I make these nice values appear on the numeric axis.

This has the effect of looking like a plot of the untransformed data, where the box shows q1 = 10^q1log, q2 = 10^q2log and q3 = 10^q3log, and whiskers are drawn between the box and adjacent values, where adjacent values are now defined as: the upper adjacent value is the largest value of y not exceeding U, and the lower adjacent value is the smallest value of y exceeding L, where U = 10^Ulog = q3* (q3/q1)^1.5, and L = 10^Llog = q1 / (q3/q1)^1.5. It could be argued this results in a generalization of Tukey's definition of whiskers.”

Other nonlinear scales

The same issue with box plots and change of scale arises with any nonlinear transformation. The calculation of median and quartiles and the selection of data points for separate plotting need to be done afresh on any new scale. For example, psychologists and others work with times taken by test subjects to complete a task. The distributions are often highly skewed, and subjects who do not complete should be assigned missing values. The reciprocal of time is a speed; missing times can be recoded as zero speeds. Here, again, you would need to do the transformation yourself and possibly fix the axis labels too.

For broader discussion of box plots within Stata, including how to create your own variants on the default design, see Cox (2009, 2013).

References

Cox, N. J. 2009.
Speaking Stata: Creating and varying box plots. Stata Journal 9: 478–496.
Cox, N. J. 2013.
Speaking Stata: Creating and varying box plots: Correction. Stata Journal 13: 398–400.
Frigge, M., D. C. Hoaglin, and B. Iglewicz. 1989.
Some implementations of the box plot. American Statistician 43: 50–54.
Tukey, J. W. 1977.
Exploratory Data Analysis. Reading, MA: Addison–Wesley.