Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE getting rid of the outliners


From   "Michael Blasnik" <michael.blasnik@verizon.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE getting rid of the outliners
Date   Mon, 01 May 2006 08:35:41 -0400

Ronnie Babigumira <rb.glists@gmail.com> wrote:
Subject: Re: st: RE getting rid of the outliners
<snip> That said, I have a follow up question for you

Using the fences created by

local u = r(p75) + (3/2) * (r(p75) - r(p25))
local l = r(p25) - (3/2) * (r(p75) - r(p25))

Would capture "mild" outliers. So my question is, how does this sit with the discussion in for example Hamilton, Statistics with Stata, which distinguishes between mild and severe outliers pointing out that it is severe outliers that create problems for many statistical techniques.
I too have thought that the standard box plot fences flag too many values as outliers. Maybe it's because I often work with fairly large N, or because I work with messy real world data, but I find so many values outside the fences that the crietria has no meaning. Based on the standard defintion, you should expect about 22 "outliers" in a sample of 1,000 when the sample is perfectly Gaussian. In my experience, 5%-10% outliers are even more common with real data.

When I want to investigate outliers, in addition to using graphs and model diagnostics (e.g., df-betas), I often define "fences" at 3 iqr above and below the median. That threshold, which should result in 0.3 outliers per 1,000 Gaussian observations, tends to give me a more manageable list of "severe" outliers to investigate.

Michael Blasnik
michael.blasnik@verizon.net
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index