Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE getting rid of the outliners


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE getting rid of the outliners
Date   Mon, 1 May 2006 23:55:56 +0100

There are several implementations of box plots, 
but Stata's follows the definition that 
outliers are at least (3/2) iqr from the nearer 
quartile. This rule of thumb comes from John 
W. Tukey, who named the box plot (but did not, 
contrary to many reports, really invent it). 

It's well documented that Tukey -- despite
having been involved with computing since the
1940s and having invented the terms "software" and
"bit", not to mention smaller points like the
FFT -- developed his rule of thumb out of 
experience in drawing box plots by hand. (N.B.!) 

The sort of datasets he was dealing with 
in that way were, it seems, typically thus << 1000 in size
and so in a way the rule goes in a circle 
with the number of outliers you might want to 
plot separately, and think about. 
 
An equally simple point, but one worth underlining 
briefly, 
is that Tukey made very heavy use of transformations
to approximate symmetry, especially logarithms. 
Those not in the habit of transforming first, 
or of transforming at all, would on the whole 
see more outliers flagged than he would have done. 

Nick 
n.j.cox@durham.ac.uk 

Michael Blasnik
 
> I too have thought that the standard box plot fences flag too 
> many values as 
> outliers.  Maybe it's because I often work with fairly large 
> N, or because I 
> work with messy real world data, but I find so many values 
> outside the 
> fences that the crietria has no meaning.  Based on the 
> standard defintion, 
> you should expect about 22 "outliers" in a sample of 1,000 
> when the sample 
> is perfectly Gaussian.  In my experience, 5%-10% outliers are 
> even more 
> common with real data.
> 
> When I want to investigate outliers, in addition to using 
> graphs and model 
> diagnostics (e.g., df-betas), I often define "fences" at 3 
> iqr above and 
> below the median.  That threshold, which should result in 0.3 
> outliers per 
> 1,000 Gaussian observations, tends to give me a more 
> manageable list of 
> "severe" outliers to investigate.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index