Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: xtreg check for outliers |

Date |
Thu, 9 Aug 2012 14:48:47 +0100 |

Somewhat in the spirit of David's comment, but from a broader perspective: I think it's important to recognise that the one word "outliers" covers some quite different situations, some of which are not even problems. Indeed, one definition of outliers is that they surprise the researcher, so being an outlier is as much psychology as ontology. A complete taxonomy is necessarily elusive, and there is at least one lengthy monograph on outliers. But minimally we should distinguish 1. Outliers that are essentially mistakes, as they represent impossible or at least implausible values. These can arise from equipment malfunction, contamination of samples, human misunderstanding, lies, careless recording of data, clashes in convention, inconsistencies in measurement units, etc. Thus -999 for age is evidently a missing data code, if not a joke by data entry people. If people are still on the lookout for such outliers when doing their modelling, it is a sign that they don't know enough or are not zealous enough about data management, including data quality checking. Sometimes there is scope for re-measurement, sometimes a rough value can be estimated in other ways, but often such values just have to be excluded from the data being analysed. 2. Outliers that are genuine, require care in handling but can be accommodated by using an appropriate transformed scale for analysis. As a geographer the canonical example to me is the Amazon, which on most river measures really is big! Perhaps I am lucky but it has been my experience that most such outliers can be accommodated by either transformation or using a suitable link function, either explicitly (e.g. -glm-) or tacitly (e.g. -poisson-). Logarithms are your friend. 3. Outliers that are genuine but seem to be awkward for or destructive to any model fit tried and which the analyst is tempted to exclude from the data, or model ad hoc. A weak or inexperienced analyst yields to the temptation; a strong analyst knows several ways of including the outlier with various tricks, including devising new models. To me, the best rationale for exclusion is a substantive or scientific argument making it clear why the outlier really doesn't belong (it's a goat that doesn't belong with these sheep) and excluding outliers just because they make life statistically difficult is less convincing. Naturally, much more could be said. A purely personal aside is that I don't think that nonparametric statistics or robust statistics are quite as helpful in practice in dealng with outliers as their most energetic advocates would have you believe. Nick On Wed, Aug 8, 2012 at 1:37 PM, David Hoaglin <dchoaglin@gmail.com> wrote: > Dalhia, > > In multiple regression, "outliers" can take a variety of forms. > > An observation may have an unusual combination of values of the > predictor variables. Such points are influential. If the model fits > well there, the corresponding value of y may not be an outlier. > Cook's distance, DFFITS, and DFBETAS help to diagnose various aspects > of influence. > > Studentized residuals can show whether the model fits poorly at an > individual observation (in effect, whether that value of y is an > outlier, relative to the model). > > The variety of possibilities can make diagnosis of "outliers" challenging. On Wed, Aug 8, 2012 at 7:03 AM, Dalhia <ggs_da@yahoo.com> wrote: >> How do I check for outliers when using xtreg, fe? One >> solution I thought of was to demean each variable for each panel, and >> then rerun using regress, and then use the cook's d, dfits, avplot etc. >> to identify outliers. Is this a reasonable solution? Is there a >> different/better way to do this? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: xtreg check for outliers***From:*Richard Goldstein <richgold@ix.netcom.com>

**References**:**st: One Step and 2 step GMM***From:*muneer <mkmmuneerbabu@gmail.com>

**st: xtreg check for outliers***From:*Dalhia <ggs_da@yahoo.com>

**Re: st: xtreg check for outliers***From:*David Hoaglin <dchoaglin@gmail.com>

- Prev by Date:
**Re: st: Change of reference group in linear multiple regression analysis** - Next by Date:
**Re: st: xtreg check for outliers** - Previous by thread:
**Re: st: xtreg check for outliers** - Next by thread:
**Re: st: xtreg check for outliers** - Index(es):