|Title||Box plots and logarithmic scales|
|Author||Nicholas J. Cox, Durham University, UK|
|Date||September 2005; minor revisions March 2014|
The purpose of this FAQ is to point out a potential pitfall with graph box and graph hbox and to explain a way around it. Sometimes users fire up a box plot in Stata, realize that a logarithmic scale would be better for their data, and then ask for that by yscale(log) (with either graph box or graph hbox). (From now on examples will be just in terms of graph box, as the principle is the same for both.)
Although Stata will let you do this, you should be aware of what option yscale(log) actually does. As with all other graphs, yscale(log) takes the graph you would have gotten otherwise and warps it logarithmically. What it does not do is recalculate summaries on the log scale, which, with a box plot, is what you might want. However, making yscale(log) have a special meaning for box plots would be bad software design, whatever the statistical arguments, so if the pitfall to be discussed here matters to you, then you will need to work your way around it.
In what follows, I assume that each variable to be shown on a box plot is all positive, because logarithmic transformation is not defined otherwise. As you may have noticed, if you ask graph to use yscale(log) when zero or negative values are present, it just gives you a ridiculous graph, rather like the kind of teacher who will not say, “That was a stupid thing to ask,” but will just give you a funny look that clearly means, “Do think about that a bit more.”
Methods for box plots differ by book and by program. Frigge, Hoaglin, and Iglewicz (1989) cataloged several variants, and no doubt a careful search would reveal some they missed and others that have arisen since. Stata follows what Tukey (1977) settled on after trying various possibilities. The most important detail is that a data point is plotted separately if it lies more than 1.5 times the interquartile range away from the nearer quartile. This calculation depends on the scale being used: if you redo it on a logarithmic scale, you will often get a different decision. Points declared as deserving separate plotting on the original scale may not be so declared on the logarithmic scale. Thus some high values plotted separately may jump back inside the main box-and-whiskers cluster. Conversely, some low values may jump out of that cluster and now be declared as deserving separate plotting. The reclassifications reflect that the interquartile range of logarithms is not, in general, the logarithm of the interquartile range.
The same issue can affect, although usually not as much, the calculated median and quartiles. Each can be based on interpolation between data points, and so it is not always true that, say, the median of the logarithms is exactly the same as the logarithm of the median. Unless your dataset is strange and small, you would not usually be troubled by the difference, but only for the minimum and maximum is there never a small problem.
The way to do it properly is thus to take logarithms first. For example,
The reason for starting with clonevar is that you pick up the variable label. log10() has the marginal advantage over log() (or ln()) that you can calculate the inverse, the power of 10, more easily. This method can prove useful when you want to add more intelligible axis labels to the graph. Seeing the graph above, you can think that 4 means 104 or 10,000, leading to an extra specification:
However, here extra labels such as 3 “1000” 5 “100000” are too far outside the range of the data. Few of us can recall more than the integer powers of 10, but Stata can do the calculation on the fly. To add labels at the equivalents of 5,000 and 15,000, type
The help for this trick is at [P] macro. However, having to spell out several label specifications in this way is at best a little tedious. To get the best of all worlds, you will want nice axis labels on the original scale, but with Stata doing all the work that you would rather not do. One way of getting those is through the program mylabels from SSC. You just say what you want shown and specify the scale in use. For example, after
3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918 "12000" 4.17609 "15000"
You should not retype that, or even copy and paste, because it is tucked safe inside the local macro specified.
It may take a few iterations to get it right, but simply reissue mylabels until you do. Use the same local macro name. Stata is happy to overwrite it, as local macros are expendable.
The same issue with box plots and change of scale arises with any nonlinear transformation. The calculation of median and quartiles and the selection of data points for separate plotting need to be done afresh on any new scale. For example, psychologists and others work with times taken by test subjects to complete a task. The distributions are often highly skewed and subjects who do not complete should be assigned missing values. The reciprocal of time is a speed; missing times can be recoded as zero speeds. Here again you would need to do the transformation yourself and possibly fix the axis labels, too.
For broader discussion of box plots within Stata, including how to create your own variants on the default design, see Cox (2009, 2013).