Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Drop if beyond certain percentile |
Date | Fri, 30 Mar 2012 11:00:56 +0100 |
Typo time: su y, detail On Fri, Mar 30, 2012 at 10:59 AM, Nick Cox <njcoxstata@gmail.com> wrote: > I can't let this pass by without delivering a comment that this is in > general a highly dubious way to analyse your data. There may well be > an "in particular" in which it makes much more sense, but you don't > give it. (For example, if your purpose was to show that this is a > lousy way, then we agree.) > > To repeat a posting of mine from 27 August 2010: > > In 1827 Olbers asked Gauss "What should count as an unusual or too > large a deviation? I would like to receive more precise directions." > Gauss in his reply was disinclined to give any more directions and > compared the situation to everyday life, where one often has to make > intuitive judgments outside the reign of formal and explicit rules. > > This is a paraphrase of a paraphrase from Gigerenzer, G. and five > friends. 1989. The empire of chance: How probability changed science > and everyday life. Cambridge University Press, p.83, who give the > reference to Olbers' Leben und Werke. > > We have had lots of smart suggestions from smart people since 1827, > but that still seems to me the best concise advice about outliers or > problematic tails: distrust the urge to ask for formal solutions. > > Rather than -drop-ping a much better strategy is to flag subsets and > see how much difference it makes > > su y detail > gen touse = inrange(y, r(p5), r(p95)) > > <analysis> > > <analysis> if touse > > A more complete discussion would be at least book length, but what I > take to be fairly standard advice follows. > > 1. It is easy to be overimpressed by irregularities in the marginal > distribution of any variable. Few analysis commands depend on what it > is exactly. > > 2. It is best if decisions about outliers are informed by scientific > or practical knowledge about the measurement process and possible or > plausible values. > > 3. Transformation of a variable, robust methods and non-identity link > functions are all standard methods that tend to reduce possibly malign > influences from outliers, although there is much lively discussion > about their relative merit. > > 4. Trimming (to give this method its usual name) was a popular method > for getting e.g. robust estimates of the general level of a variable > in the 1960s and 1970s, some fraction of its appeal being that it is > very easy to explain and easy to calculate. But it was not even clear > quite how best to extend the idea to calculating measures of spread, > and making an entire analysis contingent on trimming extremes of a > single variable is a different game altogether. > > To put it pragmatically, downstream of this is presumably a > presentation, a paper, a thesis, whatever, and there is a high > probability that some fraction of your audience or appraisers will > think this a bad idea and that most if not all of the others would > still expect at a minimum some assessment of how much difference it > makes. > > Nick > > > On Fri, Mar 30, 2012 at 6:12 AM, Sandy Y. Zhu <sandy.zhu@yale.edu> wrote: > >> Would anyone happen to know how to drop datapoints that are larger >> than or smaller than certain percentiles? For example, I would like to >> drop any observations that's higher than 95% percentile or lower than >> 5% percentile in my dataset. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/