Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Drop if beyond certain percentile

From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Drop if beyond certain percentile
Date   Fri, 30 Mar 2012 10:59:17 +0100

I can't let this pass by without delivering a comment that this is in
general a highly dubious way to analyse your data. There may well be
an "in particular" in which it makes much more sense, but you don't
give it. (For example, if your purpose was to show that this is a
lousy way, then we agree.)

To repeat a posting of mine from 27 August 2010:

In 1827 Olbers asked Gauss "What should count as an unusual or too
large a deviation? I would like to receive more precise directions."
Gauss in his reply was disinclined to give any more directions and
compared the situation to everyday life, where one often has to make
intuitive judgments outside the reign of formal and explicit rules.

This is a paraphrase of a paraphrase from Gigerenzer, G. and five
friends. 1989. The empire of chance: How probability changed science
and everyday life. Cambridge University Press, p.83, who give the
reference to Olbers' Leben und Werke.

We have had lots of smart suggestions from smart people since 1827,
but that still seems to me the best concise advice about outliers or
problematic tails: distrust the urge to ask for formal solutions.

Rather than -drop-ping a much better strategy is to flag subsets and
see how much difference it makes

su y detail
gen touse = inrange(y, r(p5), r(p95))


<analysis> if touse

A more complete discussion would be at least book length, but what I
take to be fairly standard advice follows.

1. It is easy to be overimpressed by irregularities in the marginal
distribution of any variable. Few analysis commands depend on what it
is exactly.

2. It is best if decisions about outliers are informed by scientific
or practical knowledge about the measurement process and possible or
plausible values.

3. Transformation of  a variable, robust methods and non-identity link
functions are all standard methods that tend to reduce possibly malign
influences from outliers, although there is much lively discussion
about their relative merit.

4. Trimming (to give this method its usual name) was a popular method
for getting e.g. robust estimates of the general level of a variable
in the 1960s and 1970s, some fraction of its appeal being that it is
very easy to explain and easy to calculate. But it was not even clear
quite how best to extend the idea to calculating measures of spread,
and making an entire analysis contingent on trimming extremes of a
single variable is a different game altogether.

To put it pragmatically, downstream of this is presumably a
presentation, a paper, a thesis, whatever, and there is a high
probability that some fraction of your audience or appraisers will
think this a bad idea and that most if not all of the others would
still expect at a minimum some assessment of how much difference it


On Fri, Mar 30, 2012 at 6:12 AM, Sandy Y.  Zhu <[email protected]> wrote:

> Would anyone happen to know how to drop datapoints that are larger
> than or smaller than certain percentiles? For example, I would like to
> drop any observations that's higher than 95% percentile or lower than
> 5% percentile in my dataset.
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index