Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE getting rid of the outliners


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE getting rid of the outliners
Date   Mon, 1 May 2006 22:59:49 +0100

First off, the definition of outlier [not "outliner"] 
implemented in box plots is just of several. Also, even 
in principle, getting rid of outliers on the basis of 
univariate calculations might miss many that would 
be regarded as bivariate or multivariate outliers, 
as contemplation of possible configurations on
scatter plots and their kin should make clear, to 
mention only one detail. More generally, the notion
that outliers are necessarily bad is far from the truth. 
They are often genuine extremes and highly informative
about what is going on -- or not going on -- in 
a dataset. 

The purpose of my program -adjacent- is just to _list_ 
adjacent values, as its help makes plain, the idea being
that looking at them and thinking about them might be helpful. 
No more, no less. 

There are in addition -egen- functions in -egenmore- from
SSC for putting adjacent values into variables. 

I did not consider that people might want to 
drop all adjacent values. Now that has been 
mentioned, let me say that I have absolutely 
no intention of modifying any program I write 
to support what I regard as an extraordinarily bad idea. 

What next? Why not drop all points more than 
a smidgen away from regression planes? R^2 values will
go up, supervisors will be happier, more papers
will be publishable -- and the cause of science
will have been harmed severely. That's still
a 3-1 goal difference. 

Nick 
n.j.cox@durham.ac.uk 

Maarten Buis
 
> -findit adjacent value- brings up the Nick's module
> -adjacent- which you can install. It will only show
> you the adjacent values, it does not store them so
> you can use them to drop outliers. That could be an
> oversight on the part of Nick, but I would not be
> surprised if it was deliberate to prevent people
> from mechanically dropping outliers.
> 
> Underneath I show how to create a new variable that
> is one when mpg is an outliner and zero when it is
> not, and how that variable could be used without
> dropping cases. For details have a look at:
> http://www.stata.com/support/faqs/data/trueorfalse.html
> 
> 
> *----------------begin example-----------------
> sysuse auto, clear
> sum mpg, detail
> local u = r(p75) + (3/2) * (r(p75) - r(p25))
> local l = r(p25) - (3/2) * (r(p75) - r(p25))
> gen out = mpg<`l' | mpg>`u'
> hist mpg          /*histogram including outlier*/
> hist mpg if !out  /*historgram excluding outlier*/
> *---------------end example---------------------
 
vora n 

> Is there any STATA command that can drop
> the observations that are the outliners?
> 
> Let's say I graph the box-and-whisker plot
> 
> graph box y
> 
> and then the graph will show the outliners.
> Is there any built-in command that can identify
> these outliners and drop them out of my data?
> 
> Or is there any command that tells the upper
> adjacent value and the lower adjacent value
> so that I can drop the outliners manually?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index