[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
n j cox <n.j.cox@durham.ac.uk> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: Logarithmic scales and box plots |

Date |
Thu, 22 Sep 2005 12:12:23 +0100 |

Something I have seen locally seems to deserve a wider note.

Users may fire up a box plot in Stata, realise that a logarithmic scale

would be better for their variable, and then ask for that by

-ysc(log)- or -xsc(log)-.

Arguably,

1. Stata will let you do this, but in a sense it should not. Almost always, the result will be not quite what you want, or what you would

want if you were concentrating fully.

2. For private exploration, the difference may be of little

consequence, but even then it is possible to be puzzled or even misled.

3. Doing it properly, especially for public reports, is possible, and not too difficult.

The point at issue is probably too much a confusing complication

for the elementary books, and too obvious or too trivial for smarter

writers to bother about, so it can fall between those two stools. If anyone knows of a discussion in print, I would appreciate a reference. (Incidentally, I have noticed an astonishing trend, that increasingly no

introductory statistics book is considered complete without colour photos of smiling people of different kinds, even if completely irrelevant to the material discussed!)

In what follows I assume that your variable is all positive, because

otherwise a logarithmic scale is not defined. As you may have

noticed, if -graph- is asked to -?sc(log)- when zero or negative values

are present, it just gives you a ridiculous graph, rather like the kind of teacher who will not say "That was a stupid comment", but just give you a funny look which clearly means, "Do think about that a bit more".

The main issue is one of division of labour. -ysc(log)- and -xsc(log)-

just take the graph you would have got otherwise and warp it logarithmically. However, what neither does is to re-calculate summaries on the log scale. In a sense, your punishment here is that you got

what you asked for.

Recipes for box plots differ from book to book and program to program. Back in 1989 Frigge and friends catalogued several variants in _The

American Statistician_, and no doubt a careful trawl would reveal

some they missed and others that have arisen since. Stata follows what

John Tukey settled on after trying various possibilities, most importantly here that a data point is plotted separately if it lies more than 1.5 times the interquartile range away from the nearer quartile. If you re-do this on a logarithmic scale, you will almost always get a different answer whenever such points exist, and sometimes even if they do not. Some high values plotted separately may jump back inside the main box-and-whiskers cluster. Some low values may even jump out of that cluster and now be plotted separately. The re-classifications reflect the fact that the interquartile range of logarithms is not in general the logarithm of the interquartile ranges.

The same issue can affect, although usually to a lesser extent, the

calculated median and quartiles. Each can be based on interpolation

between data points, and so it is not always true that (for example)

the median of the logarithms is _exactly_ the same as the logarithm of the median. Admittedly, unless your data set is very strange and very small, you would not usually be troubled by the difference, but only for the minimum and maximum is there absolutely no problem.

The way to do it properly is thus to take logarithms first, e.g.

. sysuse auto, clear

. gen log10price = log10(price)

. graph box log10price

Here -log10()- has the marginal advantage over -log()- (or -ln()-) that many users can do the inverse in their heads, thinking 4 means 10^4 = 10000, or whatever, so that you can add stuff like

. graph box log10price, yla(4 "10000")

but not many of us can remember more than the integer powers here.

A cute trick is to force Stata to do the calculation on the fly, as

in

. graph box log10price, yla(`=log10(5000)' "5000" 4 "10000" `=log10(15000)' "15000")

but if I show my colleagues that they seem to regard it as a bit of

a joke: it is admittedly tedious, even if you can remember and understand the syntax. (The relevant help here is at -help macro-.)

To get the best of all worlds, you will want several "nice" axis labels on the original scale, with Stata doing all the calculations. One way of getting those is through the program -mylabels- on SSC. With -mylabels- you just say what you want shown and what is the scale in use, and the program then does the harder bit. For example, if I say

. mylabels 3000(2000)7000 9000(3000)15000, myscale(log10(@)) local(labels)

Stata echoes

3.47712 "3000" 3.69897 "5000" 3.8451 "7000" 3.95424 "9000" 4.07918 "12000" 4.17609 "15000"

You should not retype that, or even copy and paste, because it

is tucked safe inside a local macro.

. graph box logprice, yla(`labels', ang(h))

I often find it takes a few goes to get it right, but all that

is needed is to reissue -mylabels- until you do. Use the same

local macro name. Stata is happy to overwrite it, as local macros

are totally expendable.

The same issue with box plots arises with any nonlinear

transformation, although logarithms are the most common in

practice and the most tempting in Stata given -?sc(log)-.

Thus psychologists and some others work with times taken by

rats or students to complete a task. The distributions are

often highly skew and those who do not complete a task should be

assigned missing values. The reciprocal of time is a speed: note that missing times can be recoded as zero speeds. Here again you would need

to do the transformation yourself and quite possibly fix the

axis labels too.

Nick

n.j.cox@durham.ac.uk

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: missing values** - Next by Date:
**st: -ml- evaluation: tempvars as doubles** - Previous by thread:
**st: collapse syntax error** - Next by thread:
**re: st: Logarithmic scales and box plots** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |