Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 23:29:32 +0100

I don't know that paper.

Continuing, however, with an atmospheric science theme: the Antarctic
ozone hole is a big example of a genuine outlier. The story that it
was missed for long because a program automatically rejected outliers
is, however, a myth:
<http://www.math.uni-augsburg.de/stochastik/pukelsheim/1990c.pdf>

The widespread circulation of the story in the environmental science
literature, on the other hand, reflects an attitude that rejection of
outliers often throws out the baby!

Nick

On Mon, Jun 6, 2011 at 10:11 PM, Austin Nichols <austinnichols@gmail.com> wrote:
> Nick--
> I don't pretend to know much about environmental data; but I did a
> quick introspection, followed by a quick google on air quality
> sensors, and found
> Engel-Cox, J.A. and Holloman, C.H. and Coutant, B.W. and Hoff, R.M. 2004.
> "Qualitative and quantitative evaluation of MODIS satellite sensor
> data for regional and urban scale air quality."
> Atmospheric Environment, 38(16):2495--2509.
> which seems to indicate a lot of concern about extreme readings on air
> quality; if one were regressing some y on air quality and other
> explanatory variables, it seems reasonable to drop the extreme
> measurements.  Better to get auxiliary measurements and run IV for
> measurement error; best to get error-free measurements; but in the
> event that one must proceed with error-prone data, dropping the
> extreme data on explanatory variables can often be a reasonable step.
>
> On Mon, Jun 6, 2011 at 4:59 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> Thanks for the clarification.
>>
>> On your last question, I think that usually makes no physical sense
>> for environmental data where I have most experience. I am straining to
>> imagine that it is anything other than horribly ad hoc in any
>> application.
>>
>> On dummies for outliers: better than dropping them; good if there is
>> some independent rationale.
>>
>> One definition of an outlier is that it surprises the analyst, and the
>> best outcome is to think of a model in which the surprise disappears.
>> Working on a logarithmic scale is so far as I can see the best trick,
>> if not the oldest. (Thucydides recorded the use of the mode as a
>> robust estimator, alhough not quite in those words, about 2400 years
>> ago.)
>>
>> Nick
>>
>> On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>>> Nick--
>>> The simulation is contrived to illustrate one and only one point:
>>> trimming data based on values of X that are suspect is fine, but
>>> trimming data based on values of y that are suspect is dangerous at
>>> best and nearly always ill-advised.  This is a point I have made many
>>> times on the list, sometimes in the context of replying to folks who
>>> want to take the log of zero. Note I have made no mention of model
>>> residuals; that is a different kind of outlier detection with its own
>>> issues.  The poster asked about trimming data based on the variables'
>>> values alone, and my point was that this is not a bad idea a priori as
>>> long as you only do it to RHS (explanatory) variables and not LHS
>>> (outcome) variables.  I think Jeff and Richard are thinking in terms
>>> of model outliers, perhaps in terms of leverage or such.  Your Amazon
>>> example could fall in any of these categories, but including an Amazon
>>> dummy is no different in practice from dropping the Amazon data point,
>>> right?  Or did you have in mind allowing for nonlinearities?  It makes
>>> sense in many cases to fit a best linear approximation to a subset of
>>> the data and then to look at the outlying data with a less linear
>>> model, no?
>>>
>>> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>> I don't think what happens in contrived simulations hits the main
>>>> methodological issue at all. As a geographer, some of the time, an
>>>> outlier to me is something like the Amazon which is big and different
>>>> and something that needs to be accommodated in the model.  That can be
>>>> done in many ways other than by discarding outliers. Once throwing
>>>> away awkward data is regarded as legitimate, when you do stop?
>>>> (Independent evidence that an outlier is untrustworthy, as in lab
>>>> records of experiments, is a different thing, although even there
>>>> there are well-known stories of discarding as a matter of prior
>>>> prejudice.)
>>>>
>>>> To make the question as stark as possible, and to suppress large areas
>>>> of grey (gray): There are people who fit the data to the model and
>>>> people who fit models to the data. It may sound like the same thing,
>>>> but the attitude that one is so confident that the model is right that
>>>> you are happy to discard the most inconvenient data is not at all the
>>>> same as the attitude that the data can tell you something about the
>>>> inadequacies of the current model.
>>>>
>>>> Nick
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index