Ronnie Babigumira <rb.glists@gmail.com> wrote:
Subject: Re: st: RE getting rid of the outliners
<snip> That said, I have a follow up question for you
Using the fences created by
local u = r(p75) + (3/2) * (r(p75) - r(p25))
local l = r(p25) - (3/2) * (r(p75) - r(p25))
Would capture "mild" outliers. So my question is, how does this sit with
the discussion in for example Hamilton, Statistics with Stata, which
distinguishes between mild and severe outliers pointing out that it is
severe outliers that create problems for many statistical techniques.
I too have thought that the standard box plot fences flag too many values as
outliers. Maybe it's because I often work with fairly large N, or because I
work with messy real world data, but I find so many values outside the
fences that the crietria has no meaning. Based on the standard defintion,
you should expect about 22 "outliers" in a sample of 1,000 when the sample
is perfectly Gaussian. In my experience, 5%-10% outliers are even more
common with real data.
When I want to investigate outliers, in addition to using graphs and model
diagnostics (e.g., df-betas), I often define "fences" at 3 iqr above and
below the median. That threshold, which should result in 0.3 outliers per
1,000 Gaussian observations, tends to give me a more manageable list of
"severe" outliers to investigate.
Michael Blasnik
michael.blasnik@verizon.net
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/