Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: xtreg check for outliers

From   Nick Cox <>
To   Richard Goldstein <>
Subject   Re: st: xtreg check for outliers
Date   Thu, 9 Aug 2012 15:11:06 +0100

As usual, I think my view doesn't differ from Richard's.

Surprise always implies a context in which surprise is expressed and
the current model, in the widest sense of ideas about the data
generating process, is the statistical person's context. Most humour
depends on this point, although less prosaically. Also compare

Bill is my friend. He is 7 feet tall.

Bill is a basketball player. He is 7 feet tall.

The context makes a lot of difference here. In the first, one's
reaction is "That's tall!". In the second, "Sure".

On Thu, Aug 9, 2012 at 3:01 PM, Richard Goldstein
<> wrote:
> my view is a little different
> an outlier is a surprising value; it is surprising because one is
> comparing it, sometimes implicitly, to a model -- once you determine
> that the value is not an error, you need to consider whether you are
> using the "right" model -- changing the model will often change which
> values, if any, are "outliers"
> Rich
> On 8/9/12 9:48 AM, Nick Cox wrote:
>> Somewhat in the spirit of David's comment, but from a broader perspective:
>> I think it's important to recognise that the one word "outliers"
>> covers some quite different situations, some of which are not even
>> problems. Indeed, one definition of outliers is that they surprise the
>> researcher, so being an outlier is as much psychology as ontology.
>> A complete taxonomy is necessarily elusive, and there is at least one
>> lengthy monograph on outliers. But minimally we should distinguish
>> 1. Outliers that are essentially mistakes, as they represent
>> impossible or at least implausible values. These can arise from
>> equipment malfunction, contamination of samples, human
>> misunderstanding, lies, careless recording of data, clashes in
>> convention, inconsistencies in measurement units, etc. Thus -999 for
>> age is evidently a missing data code, if not a joke by data entry
>> people. If people are still on the lookout for such outliers when
>> doing their modelling, it is a sign that they don't know enough or are
>> not zealous enough about data management, including data quality
>> checking. Sometimes there is scope for re-measurement, sometimes a
>> rough value can be estimated in other ways, but often such values just
>> have to be excluded from the data being analysed.
>> 2. Outliers that are genuine, require care in handling but can be
>> accommodated by using an appropriate transformed scale for analysis.
>> As a geographer the canonical example to me is the Amazon, which on
>> most river measures really is big! Perhaps I am lucky but it has been
>> my experience that most such outliers can be accommodated by either
>> transformation or using a suitable link function, either explicitly
>> (e.g. -glm-) or tacitly (e.g. -poisson-). Logarithms are your friend.
>> 3. Outliers that are genuine but seem to be awkward for or destructive
>> to any model fit tried and which the analyst is tempted to exclude
>> from the data, or model ad hoc. A weak or inexperienced analyst yields
>> to the temptation; a strong analyst knows several ways of including
>> the outlier with various tricks, including devising new models. To me,
>> the best rationale for exclusion is a substantive or scientific
>> argument making it clear why the outlier really doesn't belong (it's a
>> goat that doesn't belong with these sheep) and excluding outliers just
>> because they make life statistically difficult is less convincing.
>> Naturally, much more could be said. A purely personal aside is that I
>> don't think that nonparametric statistics or robust statistics are
>> quite as helpful in practice in dealng with outliers as their most
>> energetic advocates would have you believe.
>> Nick
>> On Wed, Aug 8, 2012 at 1:37 PM, David Hoaglin <> wrote:
>>> Dalhia,
>>> In multiple regression, "outliers" can take a variety of forms.
>>> An observation may have an unusual combination of values of the
>>> predictor variables.  Such points are influential.  If the model fits
>>> well there, the corresponding value of y may not be an outlier.
>>> Cook's distance, DFFITS, and DFBETAS help to diagnose various aspects
>>> of influence.
>>> Studentized residuals can show whether the model fits poorly at an
>>> individual observation (in effect, whether that value of y is an
>>> outlier, relative to the model).
>>> The variety of possibilities can make diagnosis of "outliers" challenging.
>>  On Wed, Aug 8, 2012 at 7:03 AM, Dalhia <> wrote:
>>>> How do I check for outliers when using xtreg, fe? One
>>>> solution I thought of was to demean each variable for each panel, and
>>>> then rerun using regress, and then use the cook's d, dfits, avplot etc.
>>>> to identify outliers. Is this a reasonable solution? Is there a
>>>> different/better way to do this?
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index