Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 16:35:25 -0400

Nick--
The simulation is contrived to illustrate one and only one point:
trimming data based on values of X that are suspect is fine, but
trimming data based on values of y that are suspect is dangerous at
best and nearly always ill-advised.  This is a point I have made many
times on the list, sometimes in the context of replying to folks who
want to take the log of zero. Note I have made no mention of model
residuals; that is a different kind of outlier detection with its own
issues.  The poster asked about trimming data based on the variables'
values alone, and my point was that this is not a bad idea a priori as
long as you only do it to RHS (explanatory) variables and not LHS
(outcome) variables.  I think Jeff and Richard are thinking in terms
of model outliers, perhaps in terms of leverage or such.  Your Amazon
example could fall in any of these categories, but including an Amazon
dummy is no different in practice from dropping the Amazon data point,
right?  Or did you have in mind allowing for nonlinearities?  It makes
sense in many cases to fit a best linear approximation to a subset of
the data and then to look at the outlying data with a less linear
model, no?

On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> I don't think what happens in contrived simulations hits the main
> methodological issue at all. As a geographer, some of the time, an
> outlier to me is something like the Amazon which is big and different
> and something that needs to be accommodated in the model.  That can be
> done in many ways other than by discarding outliers. Once throwing
> away awkward data is regarded as legitimate, when you do stop?
> (Independent evidence that an outlier is untrustworthy, as in lab
> records of experiments, is a different thing, although even there
> there are well-known stories of discarding as a matter of prior
> prejudice.)
>
> To make the question as stark as possible, and to suppress large areas
> of grey (gray): There are people who fit the data to the model and
> people who fit models to the data. It may sound like the same thing,
> but the attitude that one is so confident that the model is right that
> you are happy to discard the most inconvenient data is not at all the
> same as the attitude that the data can tell you something about the
> inadequacies of the current model.
>
> Nick

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index