Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: xtreg check for outliers


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: xtreg check for outliers
Date   Thu, 9 Aug 2012 14:48:47 +0100

Somewhat in the spirit of David's comment, but from a broader perspective:

I think it's important to recognise that the one word "outliers"
covers some quite different situations, some of which are not even
problems. Indeed, one definition of outliers is that they surprise the
researcher, so being an outlier is as much psychology as ontology.

A complete taxonomy is necessarily elusive, and there is at least one
lengthy monograph on outliers. But minimally we should distinguish

1. Outliers that are essentially mistakes, as they represent
impossible or at least implausible values. These can arise from
equipment malfunction, contamination of samples, human
misunderstanding, lies, careless recording of data, clashes in
convention, inconsistencies in measurement units, etc. Thus -999 for
age is evidently a missing data code, if not a joke by data entry
people. If people are still on the lookout for such outliers when
doing their modelling, it is a sign that they don't know enough or are
not zealous enough about data management, including data quality
checking. Sometimes there is scope for re-measurement, sometimes a
rough value can be estimated in other ways, but often such values just
have to be excluded from the data being analysed.

2. Outliers that are genuine, require care in handling but can be
accommodated by using an appropriate transformed scale for analysis.
As a geographer the canonical example to me is the Amazon, which on
most river measures really is big! Perhaps I am lucky but it has been
my experience that most such outliers can be accommodated by either
transformation or using a suitable link function, either explicitly
(e.g. -glm-) or tacitly (e.g. -poisson-). Logarithms are your friend.

3. Outliers that are genuine but seem to be awkward for or destructive
to any model fit tried and which the analyst is tempted to exclude
from the data, or model ad hoc. A weak or inexperienced analyst yields
to the temptation; a strong analyst knows several ways of including
the outlier with various tricks, including devising new models. To me,
the best rationale for exclusion is a substantive or scientific
argument making it clear why the outlier really doesn't belong (it's a
goat that doesn't belong with these sheep) and excluding outliers just
because they make life statistically difficult is less convincing.

Naturally, much more could be said. A purely personal aside is that I
don't think that nonparametric statistics or robust statistics are
quite as helpful in practice in dealng with outliers as their most
energetic advocates would have you believe.

Nick

On Wed, Aug 8, 2012 at 1:37 PM, David Hoaglin <dchoaglin@gmail.com> wrote:
> Dalhia,
>
> In multiple regression, "outliers" can take a variety of forms.
>
> An observation may have an unusual combination of values of the
> predictor variables.  Such points are influential.  If the model fits
> well there, the corresponding value of y may not be an outlier.
> Cook's distance, DFFITS, and DFBETAS help to diagnose various aspects
> of influence.
>
> Studentized residuals can show whether the model fits poorly at an
> individual observation (in effect, whether that value of y is an
> outlier, relative to the model).
>
> The variety of possibilities can make diagnosis of "outliers" challenging.

 On Wed, Aug 8, 2012 at 7:03 AM, Dalhia <ggs_da@yahoo.com> wrote:

>> How do I check for outliers when using xtreg, fe? One
>> solution I thought of was to demean each variable for each panel, and
>> then rerun using regress, and then use the cook's d, dfits, avplot etc.
>> to identify outliers. Is this a reasonable solution? Is there a
>> different/better way to do this?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index