Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <[email protected]> |

To |
[email protected] |

Subject |
Re: st: Drop if beyond certain percentile |

Date |
Fri, 30 Mar 2012 10:59:17 +0100 |

I can't let this pass by without delivering a comment that this is in general a highly dubious way to analyse your data. There may well be an "in particular" in which it makes much more sense, but you don't give it. (For example, if your purpose was to show that this is a lousy way, then we agree.) To repeat a posting of mine from 27 August 2010: In 1827 Olbers asked Gauss "What should count as an unusual or too large a deviation? I would like to receive more precise directions." Gauss in his reply was disinclined to give any more directions and compared the situation to everyday life, where one often has to make intuitive judgments outside the reign of formal and explicit rules. This is a paraphrase of a paraphrase from Gigerenzer, G. and five friends. 1989. The empire of chance: How probability changed science and everyday life. Cambridge University Press, p.83, who give the reference to Olbers' Leben und Werke. We have had lots of smart suggestions from smart people since 1827, but that still seems to me the best concise advice about outliers or problematic tails: distrust the urge to ask for formal solutions. Rather than -drop-ping a much better strategy is to flag subsets and see how much difference it makes su y detail gen touse = inrange(y, r(p5), r(p95)) <analysis> <analysis> if touse A more complete discussion would be at least book length, but what I take to be fairly standard advice follows. 1. It is easy to be overimpressed by irregularities in the marginal distribution of any variable. Few analysis commands depend on what it is exactly. 2. It is best if decisions about outliers are informed by scientific or practical knowledge about the measurement process and possible or plausible values. 3. Transformation of a variable, robust methods and non-identity link functions are all standard methods that tend to reduce possibly malign influences from outliers, although there is much lively discussion about their relative merit. 4. Trimming (to give this method its usual name) was a popular method for getting e.g. robust estimates of the general level of a variable in the 1960s and 1970s, some fraction of its appeal being that it is very easy to explain and easy to calculate. But it was not even clear quite how best to extend the idea to calculating measures of spread, and making an entire analysis contingent on trimming extremes of a single variable is a different game altogether. To put it pragmatically, downstream of this is presumably a presentation, a paper, a thesis, whatever, and there is a high probability that some fraction of your audience or appraisers will think this a bad idea and that most if not all of the others would still expect at a minimum some assessment of how much difference it makes. Nick On Fri, Mar 30, 2012 at 6:12 AM, Sandy Y. Zhu <[email protected]> wrote: > Would anyone happen to know how to drop datapoints that are larger > than or smaller than certain percentiles? For example, I would like to > drop any observations that's higher than 95% percentile or lower than > 5% percentile in my dataset. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Drop if beyond certain percentile***From:*Nick Cox <[email protected]>

**References**:**st: Drop if beyond certain percentile***From:*"Sandy Y. Zhu" <[email protected]>

- Prev by Date:
**Re: st: expand dataset** - Next by Date:
**Re: st: Drop if beyond certain percentile** - Previous by thread:
**Re: st: Drop if beyond certain percentile** - Next by thread:
**Re: st: Drop if beyond certain percentile** - Index(es):