Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers

From   Austin Nichols <>
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 17:11:27 -0400

I don't pretend to know much about environmental data; but I did a
quick introspection, followed by a quick google on air quality
sensors, and found
Engel-Cox, J.A. and Holloman, C.H. and Coutant, B.W. and Hoff, R.M. 2004.
"Qualitative and quantitative evaluation of MODIS satellite sensor
data for regional and urban scale air quality."
Atmospheric Environment, 38(16):2495--2509.
which seems to indicate a lot of concern about extreme readings on air
quality; if one were regressing some y on air quality and other
explanatory variables, it seems reasonable to drop the extreme
measurements.  Better to get auxiliary measurements and run IV for
measurement error; best to get error-free measurements; but in the
event that one must proceed with error-prone data, dropping the
extreme data on explanatory variables can often be a reasonable step.

On Mon, Jun 6, 2011 at 4:59 PM, Nick Cox <> wrote:
> Thanks for the clarification.
> On your last question, I think that usually makes no physical sense
> for environmental data where I have most experience. I am straining to
> imagine that it is anything other than horribly ad hoc in any
> application.
> On dummies for outliers: better than dropping them; good if there is
> some independent rationale.
> One definition of an outlier is that it surprises the analyst, and the
> best outcome is to think of a model in which the surprise disappears.
> Working on a logarithmic scale is so far as I can see the best trick,
> if not the oldest. (Thucydides recorded the use of the mode as a
> robust estimator, alhough not quite in those words, about 2400 years
> ago.)
> Nick
> On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <> wrote:
>> Nick--
>> The simulation is contrived to illustrate one and only one point:
>> trimming data based on values of X that are suspect is fine, but
>> trimming data based on values of y that are suspect is dangerous at
>> best and nearly always ill-advised.  This is a point I have made many
>> times on the list, sometimes in the context of replying to folks who
>> want to take the log of zero. Note I have made no mention of model
>> residuals; that is a different kind of outlier detection with its own
>> issues.  The poster asked about trimming data based on the variables'
>> values alone, and my point was that this is not a bad idea a priori as
>> long as you only do it to RHS (explanatory) variables and not LHS
>> (outcome) variables.  I think Jeff and Richard are thinking in terms
>> of model outliers, perhaps in terms of leverage or such.  Your Amazon
>> example could fall in any of these categories, but including an Amazon
>> dummy is no different in practice from dropping the Amazon data point,
>> right?  Or did you have in mind allowing for nonlinearities?  It makes
>> sense in many cases to fit a best linear approximation to a subset of
>> the data and then to look at the outlying data with a less linear
>> model, no?
>> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <> wrote:
>>> I don't think what happens in contrived simulations hits the main
>>> methodological issue at all. As a geographer, some of the time, an
>>> outlier to me is something like the Amazon which is big and different
>>> and something that needs to be accommodated in the model.  That can be
>>> done in many ways other than by discarding outliers. Once throwing
>>> away awkward data is regarded as legitimate, when you do stop?
>>> (Independent evidence that an outlier is untrustworthy, as in lab
>>> records of experiments, is a different thing, although even there
>>> there are well-known stories of discarding as a matter of prior
>>> prejudice.)
>>> To make the question as stark as possible, and to suppress large areas
>>> of grey (gray): There are people who fit the data to the model and
>>> people who fit models to the data. It may sound like the same thing,
>>> but the attitude that one is so confident that the model is right that
>>> you are happy to discard the most inconvenient data is not at all the
>>> same as the attitude that the data can tell you something about the
>>> inadequacies of the current model.
>>> Nick

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index