Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 21:59:00 +0100

Thanks for the clarification.

On your last question, I think that usually makes no physical sense
for environmental data where I have most experience. I am straining to
imagine that it is anything other than horribly ad hoc in any
application.

On dummies for outliers: better than dropping them; good if there is
some independent rationale.

One definition of an outlier is that it surprises the analyst, and the
best outcome is to think of a model in which the surprise disappears.
Working on a logarithmic scale is so far as I can see the best trick,
if not the oldest. (Thucydides recorded the use of the mode as a
robust estimator, alhough not quite in those words, about 2400 years
ago.)

Nick

On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <austinnichols@gmail.com> wrote:
> Nick--
> The simulation is contrived to illustrate one and only one point:
> trimming data based on values of X that are suspect is fine, but
> trimming data based on values of y that are suspect is dangerous at
> best and nearly always ill-advised.  This is a point I have made many
> times on the list, sometimes in the context of replying to folks who
> want to take the log of zero. Note I have made no mention of model
> residuals; that is a different kind of outlier detection with its own
> issues.  The poster asked about trimming data based on the variables'
> values alone, and my point was that this is not a bad idea a priori as
> long as you only do it to RHS (explanatory) variables and not LHS
> (outcome) variables.  I think Jeff and Richard are thinking in terms
> of model outliers, perhaps in terms of leverage or such.  Your Amazon
> example could fall in any of these categories, but including an Amazon
> dummy is no different in practice from dropping the Amazon data point,
> right?  Or did you have in mind allowing for nonlinearities?  It makes
> sense in many cases to fit a best linear approximation to a subset of
> the data and then to look at the outlying data with a less linear
> model, no?
>
> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> I don't think what happens in contrived simulations hits the main
>> methodological issue at all. As a geographer, some of the time, an
>> outlier to me is something like the Amazon which is big and different
>> and something that needs to be accommodated in the model.  That can be
>> done in many ways other than by discarding outliers. Once throwing
>> away awkward data is regarded as legitimate, when you do stop?
>> (Independent evidence that an outlier is untrustworthy, as in lab
>> records of experiments, is a different thing, although even there
>> there are well-known stories of discarding as a matter of prior
>> prejudice.)
>>
>> To make the question as stark as possible, and to suppress large areas
>> of grey (gray): There are people who fit the data to the model and
>> people who fit models to the data. It may sound like the same thing,
>> but the attitude that one is so confident that the model is right that
>> you are happy to discard the most inconvenient data is not at all the
>> same as the attitude that the data can tell you something about the
>> inadequacies of the current model.
>>
>> Nick

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index