I would add one point to Nick's laundry list -- an outlier
is a surprising result and it is often surprising because
we have used a particular model -- thinking about why
we obtained the surprise can sometimes lead to a different
model without any outliers.
Rich
Nick Cox wrote:
Sure, there is a -winsor- ado which I wrote on SSC
and, according to Kit Baum's reports, it is quite heavily
used. I have never used it myself, bar in development.
I cannot recall the details, but perhaps someone
wrote into Statalist reporting that it seemed that
Stata did not support Winsorizing and that was a black
mark against Stata. To which the best reply was a
program, being concrete evidence that you can easily do
Winsorizing in Stata and here is one way to do it.
But let us look at the wider picture. There is no
one way to deal with outliers. There are many ways
to deal with outliers, including
1. Going out "into the field" and doing the measurement
again.
2. Testing whether they are genuine. Most of the
tests look pretty contrived to me, but you might find one
that you can believe fits your situation. Irrational
faith that a test is appropriate is always needed
to apply a test that is then presented as quintessentially
rational.
3. Throwing them out as a matter of judgement, i.e.
in Stata terms -drop-ping them from the data.
4. Throwing them out using some more-or-less
automated (usually not "objective") rule.
5. Ignoring them, along the lines of either 3 or 4.
This could be formal (e.g. trimming) or just leaving
them in the dataset, but omitting them from analyses
as too hot to handle.
6. Pulling them in using some kind of adjustment,
e.g. Winsorizing.
7. Downplaying them by using some other robust estimation
method.
8. Downplaying them by working on a transformed
scale.
9. Downplaying them by using a non-identical link
function.
10. Accommodating them by fitting some appropriate
fat-, long-, or heavy-tailed distribution, without
or with predictors.
11. Sidestepping the issue by using some non-parametric
(e.g. rank-based) procedure.
12. Getting a handle on the implied uncertainty
using bootstrapping, jackknifing or permutation-based
procedure.
13. Editing to replace an outlier with some more
likely value, based on deterministic logic. "An 18-
year-old grandmother is unlikely, but the person
in question was born in 1926, so presumably is
really 81."
14. Editing to replace an impossible or implausible
outlier using some imputation method that is currently
acceptable not-quite-white magic.
15. Analysing with and without, and seeing how much
difference the outlier(s) make(s), statistically,
scientifically or practically.
16. Something Bayesian. My prior ignorance of quite
what forbids from giving any details.
Naturally, these categories intergrade in some
cases, and I can believe I have forgotten
or am not aware of yet other approaches.
What is quite striking to me -- as with many
any areas of statistical science -- is how much
preferred solutions vary between investigator
and discipline, despite the broad similarity
of the problems that outliers pose.
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/