# st: Logarithmic scales and box plots

 From n j cox To statalist@hsphsun2.harvard.edu Subject st: Logarithmic scales and box plots Date Thu, 22 Sep 2005 12:12:23 +0100

Something I have seen locally seems to deserve a wider note.

Users may fire up a box plot in Stata, realise that a logarithmic scale
would be better for their variable, and then ask for that by
-ysc(log)- or -xsc(log)-.

Arguably,

1. Stata will let you do this, but in a sense it should not. Almost always, the result will be not quite what you want, or what you would
want if you were concentrating fully.

2. For private exploration, the difference may be of little
consequence, but even then it is possible to be puzzled or even misled.

3. Doing it properly, especially for public reports, is possible, and not too difficult.

The point at issue is probably too much a confusing complication
for the elementary books, and too obvious or too trivial for smarter
writers to bother about, so it can fall between those two stools. If anyone knows of a discussion in print, I would appreciate a reference. (Incidentally, I have noticed an astonishing trend, that increasingly no
introductory statistics book is considered complete without colour photos of smiling people of different kinds, even if completely irrelevant to the material discussed!)

In what follows I assume that your variable is all positive, because
otherwise a logarithmic scale is not defined. As you may have
noticed, if -graph- is asked to -?sc(log)- when zero or negative values
are present, it just gives you a ridiculous graph, rather like the kind of teacher who will not say "That was a stupid comment", but just give you a funny look which clearly means, "Do think about that a bit more".

The main issue is one of division of labour. -ysc(log)- and -xsc(log)-
just take the graph you would have got otherwise and warp it logarithmically. However, what neither does is to re-calculate summaries on the log scale. In a sense, your punishment here is that you got

Recipes for box plots differ from book to book and program to program. Back in 1989 Frigge and friends catalogued several variants in _The
American Statistician_, and no doubt a careful trawl would reveal
some they missed and others that have arisen since. Stata follows what
John Tukey settled on after trying various possibilities, most importantly here that a data point is plotted separately if it lies more than 1.5 times the interquartile range away from the nearer quartile. If you re-do this on a logarithmic scale, you will almost always get a different answer whenever such points exist, and sometimes even if they do not. Some high values plotted separately may jump back inside the main box-and-whiskers cluster. Some low values may even jump out of that cluster and now be plotted separately. The re-classifications reflect the fact that the interquartile range of logarithms is not in general the logarithm of the interquartile ranges.

The same issue can affect, although usually to a lesser extent, the
calculated median and quartiles. Each can be based on interpolation
between data points, and so it is not always true that (for example)
the median of the logarithms is _exactly_ the same as the logarithm of the median. Admittedly, unless your data set is very strange and very small, you would not usually be troubled by the difference, but only for the minimum and maximum is there absolutely no problem.

The way to do it properly is thus to take logarithms first, e.g.

. sysuse auto, clear
. gen log10price = log10(price)
. graph box log10price

Here -log10()- has the marginal advantage over -log()- (or -ln()-) that many users can do the inverse in their heads, thinking 4 means 10^4 = 10000, or whatever, so that you can add stuff like

. graph box log10price, yla(4 "10000")

but not many of us can remember more than the integer powers here.
A cute trick is to force Stata to do the calculation on the fly, as
in

. graph box log10price, yla(`=log10(5000)' "5000" 4 "10000" `=log10(15000)' "15000")

but if I show my colleagues that they seem to regard it as a bit of
a joke: it is admittedly tedious, even if you can remember and understand the syntax. (The relevant help here is at -help macro-.)

To get the best of all worlds, you will want several "nice" axis labels on the original scale, with Stata doing all the calculations. One way of getting those is through the program -mylabels- on SSC. With -mylabels- you just say what you want shown and what is the scale in use, and the program then does the harder bit. For example, if I say

. mylabels 3000(2000)7000 9000(3000)15000, myscale(log10(@)) local(labels)

Stata echoes

3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918 "12000" 4.17609 "15000"

You should not retype that, or even copy and paste, because it
is tucked safe inside a local macro.

. graph box logprice, yla(`labels', ang(h))

I often find it takes a few goes to get it right, but all that
is needed is to reissue -mylabels- until you do. Use the same
local macro name. Stata is happy to overwrite it, as local macros
are totally expendable.

The same issue with box plots arises with any nonlinear
transformation, although logarithms are the most common in
practice and the most tempting in Stata given -?sc(log)-.
Thus psychologists and some others work with times taken by
rats or students to complete a task. The distributions are
often highly skew and those who do not complete a task should be
assigned missing values. The reciprocal of time is a speed: note that missing times can be recoded as zero speeds. Here again you would need
to do the transformation yourself and quite possibly fix the
axis labels too.

Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/