Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Elimination of outliers |

Date |
Mon, 6 Jun 2011 17:11:27 -0400 |

Nick-- I don't pretend to know much about environmental data; but I did a quick introspection, followed by a quick google on air quality sensors, and found Engel-Cox, J.A. and Holloman, C.H. and Coutant, B.W. and Hoff, R.M. 2004. "Qualitative and quantitative evaluation of MODIS satellite sensor data for regional and urban scale air quality." Atmospheric Environment, 38(16):2495--2509. which seems to indicate a lot of concern about extreme readings on air quality; if one were regressing some y on air quality and other explanatory variables, it seems reasonable to drop the extreme measurements. Better to get auxiliary measurements and run IV for measurement error; best to get error-free measurements; but in the event that one must proceed with error-prone data, dropping the extreme data on explanatory variables can often be a reasonable step. On Mon, Jun 6, 2011 at 4:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: > Thanks for the clarification. > > On your last question, I think that usually makes no physical sense > for environmental data where I have most experience. I am straining to > imagine that it is anything other than horribly ad hoc in any > application. > > On dummies for outliers: better than dropping them; good if there is > some independent rationale. > > One definition of an outlier is that it surprises the analyst, and the > best outcome is to think of a model in which the surprise disappears. > Working on a logarithmic scale is so far as I can see the best trick, > if not the oldest. (Thucydides recorded the use of the mode as a > robust estimator, alhough not quite in those words, about 2400 years > ago.) > > Nick > > On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <austinnichols@gmail.com> wrote: >> Nick-- >> The simulation is contrived to illustrate one and only one point: >> trimming data based on values of X that are suspect is fine, but >> trimming data based on values of y that are suspect is dangerous at >> best and nearly always ill-advised. This is a point I have made many >> times on the list, sometimes in the context of replying to folks who >> want to take the log of zero. Note I have made no mention of model >> residuals; that is a different kind of outlier detection with its own >> issues. The poster asked about trimming data based on the variables' >> values alone, and my point was that this is not a bad idea a priori as >> long as you only do it to RHS (explanatory) variables and not LHS >> (outcome) variables. I think Jeff and Richard are thinking in terms >> of model outliers, perhaps in terms of leverage or such. Your Amazon >> example could fall in any of these categories, but including an Amazon >> dummy is no different in practice from dropping the Amazon data point, >> right? Or did you have in mind allowing for nonlinearities? It makes >> sense in many cases to fit a best linear approximation to a subset of >> the data and then to look at the outlying data with a less linear >> model, no? >> >> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> I don't think what happens in contrived simulations hits the main >>> methodological issue at all. As a geographer, some of the time, an >>> outlier to me is something like the Amazon which is big and different >>> and something that needs to be accommodated in the model. That can be >>> done in many ways other than by discarding outliers. Once throwing >>> away awkward data is regarded as legitimate, when you do stop? >>> (Independent evidence that an outlier is untrustworthy, as in lab >>> records of experiments, is a different thing, although even there >>> there are well-known stories of discarding as a matter of prior >>> prejudice.) >>> >>> To make the question as stark as possible, and to suppress large areas >>> of grey (gray): There are people who fit the data to the model and >>> people who fit models to the data. It may sound like the same thing, >>> but the attitude that one is so confident that the model is right that >>> you are happy to discard the most inconvenient data is not at all the >>> same as the attitude that the data can tell you something about the >>> inadequacies of the current model. >>> >>> Nick * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Elimination of outliers***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: Elimination of outliers***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**Re: st: Elimination of outliers***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Elimination of outliers***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**RE: st: Elimination of outliers***From:*Nick Cox <n.j.cox@durham.ac.uk>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Elimination of outliers** - Next by Date:
**st: geocode missing tempfile** - Previous by thread:
**Re: st: Elimination of outliers** - Next by thread:
**Re: st: Elimination of outliers** - Index(es):