[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: box plot reference
"Nick Cox" <firstname.lastname@example.org>
RE: st: box plot reference
Tue, 19 Aug 2008 17:46:12 +0100
I have a few footnotes to add.
I agree that Tukey's 1977 book (reference already given) is the best single reference for the box plot.
Tukey seems to have re-invented [more on that later] the box plot about 1970. He played with different choices before the 1977 book (and after, a story not important here). The key idiosyncratic detail, which Stata follows, is that data points are shown individually if they are more than 1.5 IQR away from the nearer quartile; otherwise thin lines are drawn covering the intervals between the quartiles and any data points within 1.5 IQR of the nearer quartile. According to Paul Velleman, Tukey adopted 1.5 because 1 is too small and 2 is too large. However, someone thought up a theoretical rationale for 1.5 quite recently in the American Statistician (I forget the details).
That's the way I would describe the box plot. Tukey didn't use the terminology of quartiles, but his various alternatives (hinges, fourths, F-values) have all faded away into history despite some zealous advocacy by students and colleagues in the 1970s and 1980s. Similarly Tukey came up with alternative names for the IQR (midspread, fourth-spread, F-spread, I seem to recall) which just have not stood the test of time.
In that spirit I can't see a case for prolonging the life of "inner fences" or "adjacent values", except when no alternative is available. The box plot can be explained perfectly well without these bizarre terms. Tukey was, on his good days, an outstanding genius at inventing new terms -- software, bit, to start with -- but he had some dopey ideas too.
It is important to realise that there are many slightly different recipes for the box plot. A fairly early paper on this
Robert McGill, John W. Tukey, Wayne A. Larsen. 1978.
Variations of box plots.
American Statistician 32: 12-16.
could no doubt be supplemented by other examples, not least from
John W. Tukey. 1993.
Graphic comparisons of several linked aspects: alternatives and suggested principles.
Journal of Computational and Graphical Statistics 2: 1-33
Both of these papers are accessible to subscribers on JSTOR.
Also, I said that Tukey re-invented the box plot. Many texts will tell you that Tukey invented the box plot, but they seem to copy each other and are at the very least wrong in large part. Tukey is responsible for the excellent name "box plot", the recipe being discussed here, and (through his enormously able and energetic advocacy) for their widespread use, so praise all round on those scores. But the idea of plotting a box for the IQR and adding informative detail about data in the tails beyond is decades older, e.g. under the name of dispersion diagram (a name as dull and dreary as a drizzly day in Durham) in geography and climatology. Yes, the geographers were there first. The Graphics manual and its predecessors have long given a reference from 1933.
Thanks Maarten and Elizabeth. Tukey works. For some reason, Elizabeth's reply wasn't sent to me.
Go for Tukey as was suggested by Elizabeth:
This is the classic reference and if your school is any good it should
> My school doesn't have Moore and McCabe, but I did look in Tanis and
> Hogg and there wasn't any mention of fences or adjacent values
> (atleast when it is describing box plots). It does mention quartiles
> and medians, but nothing else. Any other ideas? I've looked through
> about 30 stats books so far.
* For searches and help try: