Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Drop if beyond certain percentile

From   Nick Cox <>
Subject   Re: st: Drop if beyond certain percentile
Date   Fri, 30 Mar 2012 11:00:56 +0100

Typo time:

su y, detail

On Fri, Mar 30, 2012 at 10:59 AM, Nick Cox <> wrote:
> I can't let this pass by without delivering a comment that this is in
> general a highly dubious way to analyse your data. There may well be
> an "in particular" in which it makes much more sense, but you don't
> give it. (For example, if your purpose was to show that this is a
> lousy way, then we agree.)
> To repeat a posting of mine from 27 August 2010:
> In 1827 Olbers asked Gauss "What should count as an unusual or too
> large a deviation? I would like to receive more precise directions."
> Gauss in his reply was disinclined to give any more directions and
> compared the situation to everyday life, where one often has to make
> intuitive judgments outside the reign of formal and explicit rules.
> This is a paraphrase of a paraphrase from Gigerenzer, G. and five
> friends. 1989. The empire of chance: How probability changed science
> and everyday life. Cambridge University Press, p.83, who give the
> reference to Olbers' Leben und Werke.
> We have had lots of smart suggestions from smart people since 1827,
> but that still seems to me the best concise advice about outliers or
> problematic tails: distrust the urge to ask for formal solutions.
> Rather than -drop-ping a much better strategy is to flag subsets and
> see how much difference it makes
> su y detail
> gen touse = inrange(y, r(p5), r(p95))
> <analysis>
> <analysis> if touse
> A more complete discussion would be at least book length, but what I
> take to be fairly standard advice follows.
> 1. It is easy to be overimpressed by irregularities in the marginal
> distribution of any variable. Few analysis commands depend on what it
> is exactly.
> 2. It is best if decisions about outliers are informed by scientific
> or practical knowledge about the measurement process and possible or
> plausible values.
> 3. Transformation of  a variable, robust methods and non-identity link
> functions are all standard methods that tend to reduce possibly malign
> influences from outliers, although there is much lively discussion
> about their relative merit.
> 4. Trimming (to give this method its usual name) was a popular method
> for getting e.g. robust estimates of the general level of a variable
> in the 1960s and 1970s, some fraction of its appeal being that it is
> very easy to explain and easy to calculate. But it was not even clear
> quite how best to extend the idea to calculating measures of spread,
> and making an entire analysis contingent on trimming extremes of a
> single variable is a different game altogether.
> To put it pragmatically, downstream of this is presumably a
> presentation, a paper, a thesis, whatever, and there is a high
> probability that some fraction of your audience or appraisers will
> think this a bad idea and that most if not all of the others would
> still expect at a minimum some assessment of how much difference it
> makes.
> Nick
> On Fri, Mar 30, 2012 at 6:12 AM, Sandy Y.  Zhu <> wrote:
>> Would anyone happen to know how to drop datapoints that are larger
>> than or smaller than certain percentiles? For example, I would like to
>> drop any observations that's higher than 95% percentile or lower than
>> 5% percentile in my dataset.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index