Title | Box plots and logarithmic scales | |

Author | Nicholas J. Cox, Durham University, UK |

The purpose of this FAQ is to point out a potential pitfall with
**graph box** and
**graph hbox**
and to explain a way around it. Sometimes users fire up a box plot in Stata,
realize that a logarithmic scale would be better for their data, and then
ask for that
by **yscale(log)** (with either **graph box** or **graph hbox**).
(From now on examples will be just in terms of **graph box**, as the principle is
the same for both.)

Although Stata will let you do this, you should be aware of what option
**yscale(log)** actually does. As with all other graphs,
**yscale(log)** takes the graph you would have gotten otherwise and warps
it logarithmically. What it does not do is recalculate summaries on the log
scale, which, with a box plot, is what you might want. However, making
**yscale(log)** have a special meaning for box plots would be bad
software design, whatever the statistical arguments, so if the pitfall to be
discussed here matters to you, then you will need to work your way around
it.

In what follows, I assume that each variable to be shown on a box plot is
all positive, because logarithmic transformation is not defined otherwise.
As you may have noticed, if you ask **graph** to use **yscale(log)**
when zero or negative values are present, it just gives you a ridiculous
graph, rather like the kind of teacher who will not say, “That was a
stupid thing to ask,” but will just give you a funny look that clearly
means, “Do think about that a bit more.”

Methods for box plots differ by book and by program. Frigge, Hoaglin, and Iglewicz (1989) cataloged several variants, and no doubt a careful search would reveal some they missed and others that have arisen since. Stata follows what Tukey (1977) settled on after trying various possibilities. The most important detail is that a data point is plotted separately if it lies more than 1.5 times the interquartile range away from the nearer quartile. This calculation depends on the scale being used: if you redo it on a logarithmic scale, you will often get a different decision. Points declared as deserving separate plotting on the original scale may not be so declared on the logarithmic scale. Thus some high values plotted separately may jump back inside the main box-and-whiskers cluster. Conversely, some low values may jump out of that cluster and now be declared as deserving separate plotting. The reclassifications reflect that the interquartile range of logarithms is not, in general, the logarithm of the interquartile range.

The same issue can affect, although usually not as much, the calculated median and quartiles. Each can be based on interpolation between data points, and so it is not always true that, say, the median of the logarithms is exactly the same as the logarithm of the median. Unless your dataset is strange and small, you would not usually be troubled by the difference, but only for the minimum and maximum is there never a small problem.

The way to do it properly is thus to take logarithms first. For example,

The reason for starting with
**clonevar** is that
you pick up the variable label. **log10()** has the marginal advantage
over **log()** (or **ln()**) that you can calculate the inverse, the
power of 10, more easily. This method can prove useful when you want to add
more intelligible axis labels to the graph. Seeing the graph above, you can
think that 4 means 10^{4} or 10,000, leading to an extra
specification:

However, here extra labels such as **3 “1000” 5
“100000”** are too far outside the range of the data. Few of
us can recall more than the integer powers of 10, but Stata can do the
calculation on the fly. To add labels at the equivalents of 5,000 and
15,000, type

The help for this trick is at
**[P] macro**.
However, having to spell out several label specifications in this way is at
best a little tedious. To get the best of all worlds, you will want nice
axis labels on the original scale, but with Stata doing all the work that
you would rather not do. One way of getting those is through the program
**mylabels** from SSC.
You just say what you want shown and specify the scale in use. For example,
after

Stata echoes

3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918 "12000" 4.17609 "15000"

You should not retype that, or even copy and paste, because it is tucked safe inside the local macro specified.

It may take a few iterations to get it right, but simply reissue
**mylabels** until you do. Use the same local macro name. Stata is happy
to overwrite it, as local macros are expendable.

The same issue with box plots and change of scale arises with any nonlinear transformation. The calculation of median and quartiles and the selection of data points for separate plotting need to be done afresh on any new scale. For example, psychologists and others work with times taken by test subjects to complete a task. The distributions are often highly skewed and subjects who do not complete should be assigned missing values. The reciprocal of time is a speed; missing times can be recoded as zero speeds. Here again you would need to do the transformation yourself and possibly fix the axis labels, too.

For broader discussion of box plots within Stata, including how to create your own variants on the default design, see Cox (2009, 2013).

- Cox, N. J. 2009.
- Speaking Stata: Creating and varying box plots.
*Stata Journal*9: 478–496.

- Cox, N. J. 2013.
- Speaking Stata: Creating and varying box plots: Correction.
*Stata Journal*13: 398–400.

- Frigge, M., D. C. Hoaglin, and B. Iglewicz. 1989.
- Some implementations of the box plot.
*American Statistician*43: 50–54.

- Tukey, J. W. 1977.
*Exploratory Data Analysis.*Reading, MA: Addison–Wesley.