Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Treatment of outliers


From   "Allan Reese (Cefas)" <allan.reese@cefas.co.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Treatment of outliers
Date   Tue, 7 Jun 2011 11:00:06 +0100

The exchanges prompted by a request to *trim* variables (technically
distinct from identifying and removing outliers) prompt me to post a
comment I bottled up at the time Peter Diggle's paper was read at the
RSS.  As it's geostatistics, Nick may have a view.

http://www.math.ntnu.no/~hrue/r-inla.org/case-studies/Diggle09/DiggleSep
t09.pdf (an odd ref, but one that google found and it works today) has
the title "Geostatistical inference under preferential sampling".  Since
the premise was that data were collected with prejudice, and the point
of the data and the modelling was to identify locations with high Pb
contamination, it seemed to me very odd that the paper includes a
throwaway comment "The measured lead concentrations included two gross
outliers in 2000, each of which we replaced by the average of the
remaining values from that year's survey."

In principle, I agree with Nick (gosh, that's a phrase gone out of
fashion) that outliers in real data need very careful consideration.
One of the major problems in the use of statistical methods is that
people apply textbook methods without noting the assumptions underlying
the data generation. (So, doctor, can we assume all your patients are
independent, identical and exchangeable from a single normal
distribution?)

A simple test of the robustness of a model is to compare the fit
with/without the use of suspected outliers.  If the fit is substantially
the same, you can use the results.  If including the outliers
substantially changes the model, you are forced to make a judgment
(non-probabilistic) on the source of the data.

I also note the original posting mentioned, "I have 150000 observations
and out of these observations I want to delete 25 observations from the
upper and lower boundaries."  

Allan 

R Allan Reese
Senior statistician, Cefas
The Nothe, Weymouth DT4 8UB 
Tel: +44 (0)1305 20 6614 -direct
Fax: +44 (0)1305 20 6601 
www.cefas.defra.gov.uk 




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index