Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: RE: st: Definition of "outside" in box plots - new reference


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: RE: st: Definition of "outside" in box plots - new reference
Date   Thu, 25 May 2006 14:42:07 +0100

I'm sympathetic to Allan's attitude here, but he can't
have it all ways. It appears that he follows Tukey in

(a) regarding box plots as general-purpose exploratory tools

and

(b) wanting box plots to use some criterion according
to which which individual data points should be flagged.

There has to be a little tension between (a) and (b).
What criterion you should use for optimal performance will depend according to what distribution underlies the data. Those who
quite reasonably object that they don't know or don't want to
say are inevitably left with a criterion that they must regard as arbitrary or of unknown applicability to their data, other than
some hunch that it does or doesn't work well in practice. Allan
seems to have painted himself into this corner.

I am very happy to believe that people can improve on quartiles
+/- 1.5 iqr for specific circumstances. I am not happy with
the implication, if such it is, that I must use different
designs for different circumstances, because as others have
pointed out earlier, there is enough lack of standardisation
in box plot design already.

Of course, it is a big jump between

(*) these data points deserve a bit more consideration (can
you check them against the original records? do we need
to work with a transformed scale? etc.)

and

(**) these data points can be formally and objectively
declared outliers.

I'd regard (**) as a subversion of what Tukey (and also
the geographers who were using box plots decades before him)
was trying to do. But it's clear that many people find
(**) seductive, even if the objectivity is someone else's
rule of thumb encoded in someone else's software.

Interestingly, Allan's main argument in his opinion piece
in _Significance_ is that box plots are under-used. If anything,
I'd regard them as -- in the literature I see -- now over-used.
Here are some of the reasons for saying that:

1. Box plots seem most useful in comparing lots of batches
and/or variables, in circumstances in which you need to
see the wood for the trees. Lots here can mean say 20 or
30. They are quite often shown, however, for two or three batches and/or
variables, but for those comparisons box plots usually throw
too much of the information away. You could show much more detail
with some gain and no loss.

2. The box plot vastly overemphasises the contrast between
the box itself and what's outside it. There is not much
magic about quartiles that they deserve so much prominence.
The contrast is insidious, because
too many students and researchers think that they can "read"
box plots, but miss ways in which the box plot misleads. My
favourite example (following one from Howard Wainer) asks
for a verbal interpretation of

|----| * |----|

This is often "read", quite confidently, as representing
a unimodal symmetric distribution with short tails. But
that interpretation forgets an immediate consequence of
the key principle: if half the values are inside the
box, then half also are outside it. Or in terms of densities,
the densities in the tails exceed those in the middle and
the best guess from this box plot -- if nothing else were
known -- is a U-shaped distribution. (Preferring to
see another graph gets high marks, too.)

3. It may be that -- at least in some fields -- box plots
are the most commonly used means of identifying and even
of rejecting outliers. It is difficult to generalise but
to the extent that this is true it probably is worsening
the quality of data analyses. The notion that some data
are just _bad_ and should be thrown out is not crazy,
but it is dangerous. In any case, thinking of outlier
identification as a univariate problem is a drastic
oversimplification.

None of these points is original, and the common
use of (e.g.) dot, quantile or other distribution plots
and smoothed density traces shows the numerous
alternatives. Also, many are easier to explain than
box plots. (Those who think that box plots are intuitive
should volunteer to teach the introductory course the
next time round.) In Stata, check out

dot-like displays: -dotplot- (official), -stripplot- (SSC), -beamplot- (SSC)

quantile and other distribution plots: -quantile- (official), -qplot-
(SJ), -distplot- (SJ)

density traces: -kdensity- (official)

Nick
n.j.cox@durham.ac.uk

Allan Reese

"Outlier Labeling With Boxplot Procedures"
C. H. SIM, F. F. GAN, and T. C. CHANG.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 100 (470): 642-652 JUN 2005

"... We recommend that the graphical
boxplot be constructed based on the knowledge of the underlying
distribution of the dataset and by controling the risk of labeling regular
observations as outliers."
------------

I raised this last year with StataCorp and got a positive reply. The messages are appended to help Jens' researches.
(R A Reese. Toolkit: boxplots. Significance Vol 2 (2005) issue 3 134-135.)

Sim and friends seem to have lost the plot, in that B&Ws are a visual way to examine data, not a significance test. If you *know* what the distribution is, why plot the data? Real data never actually come from these neat, exact, distributions, so we need a flag to direct attention to results that merit further investigation.


*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index