Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: How to find extreme values |

Date |
Tue, 27 Mar 2012 13:02:34 +0100 |

The help for -extremes- (SSC) does not show an option to generate an indicator variable and I can confirm that there is no such undocumented option. This was, and is, entirely deliberate on my part as the author of -extremes-. I don't want to (seem to) encourage people to create an indicator, which is then likely to be used in dubious decisions. The point of -extremes- is to encourage _thinking_ about extremes. There are people who feel the want or need to automate decisions about outliers, and they are naturally welcome to write their own programs in their own way. Nick On Tue, Mar 27, 2012 at 12:30 PM, Nakelse, Tebila (AfricaRice) <T.Nakelse@cgiar.org> wrote: > Thank you Nick, > The correct multiplier I had in mind is 1.5*iqr , as it is set in -extremes- as default, and not 1.25*iqr. > Anyway, -extremes- is very suitable to list the extremes value. But I don't know if -extremes- can help to create a variable to identify the extreme value in the dataset. > I guess it would be useful because of the fact that one can use to correct the extreme value (I mean Replacement with a given estimate) or exclude the extreme value from an estimation ( a regression for example). Nick Cox > See also -extremes- from SSC. For example, > > . sysuse auto > (1978 Automobile Data) > > . extremes price > > +--------------+ > | obs: price | > |--------------| > | 34. 3,291 | > | 14. 3,299 | > | 18. 3,667 | > | 68. 3,748 | > | 66. 3,798 | > +--------------+ > > +---------------+ > | 64. 12,990 | > | 28. 13,466 | > | 27. 13,594 | > | 12. 14,500 | > | 13. 15,906 | > +---------------+ > > . extremes price, iqr sep(0) > > +-----------------------+ > | obs: iqr: price | > |-----------------------| > | 53. 1.559 9,690 | > | 55. 1.580 9,735 | > | 41. 1.877 10,371 | > | 9. 1.877 10,372 | > | 11. 2.349 11,385 | > | 26. 2.401 11,497 | > | 74. 2.633 11,995 | > | 64. 3.096 12,990 | > | 28. 3.318 13,466 | > | 27. 3.378 13,594 | > | 12. 3.800 14,500 | > | 13. 4.455 15,906 | > +-----------------------+ > > The table shows that on the usual boxplot rule there are 12 outliers in -price-, not 8. Tebila missed some overlapping symbols. -graph box price, marker(1, ms(Oh))- makes the fact of overlap easier to see. As I understand it -graph box- does not support jittering (at this moment I am using an ancient Stata). > > Also, an outlier being at least 1.25 * iqr away from the nearer quartile is Tebila's convention, but it is not that used by -graph box-. -extremes- allows you to set your own multiplier. > > A further note of caution: Suppose you reject outliers as being more than so many "deviations" away from some reference "level", then recalculate (the definitions of quoted terms being up to you). You might need to iterate more than once, as new outliers could be identified at each stage. > > All that said, I would rather identify outliers than delete them, as also Maarten advocated. > > Nick > > On Tue, Mar 27, 2012 at 9:19 AM, Nakelse, Tebila (AfricaRice) <T.Nakelse@cgiar.org> wrote: > >> Find below an example of identification and correction of extreme value . >> >> *** correction of the variable price >> >> sysuse auto, clear >> >> /* plot to visualize the extreme*/ >> graph box price >> >> /* we can distinguish 8 extremes values*/ >> >> *** quartiles of price >> >> egen Q1_price= pctile(price), p(25) >> egen Q3_price= pctile(price), p(75) >> egen IC_price= iqr(price) >> >> ***Identification of extreme value >> >> gen touse=1 if (price< Q1_price-1.25*IC_price| price> >> Q3_price+1.25*IC_price) & missing(price)==0 recode touse . =0 >> >> tab touse >> >> ***Correction of the price >> gen pricec =price >> replace pricec =Q1_price-1.25*IC_price if price < >> Q1_price-1.25*IC_price & touse==1 replace pricec >> =Q3_price+1.25*IC_price if price> Q3_price+1.25*IC_price & touse==1 >> >> /* the corrected price box plot to see if the extreme value remain*/ >> *graph box pricec > > Maarten Buis >> >> On Tue, Mar 27, 2012 at 5:24 AM, Barth Riley <barthriley@comcast.net> wrote: >>> To remove outliers, you could: >>> >>> preserve >>> replace var = . if abs(var) >= 1000000 (or some other value) [perform >>> analyses] restore >>> >>> preserve and restore are added if you want to make a temporary change >>> to these values >> >> If I were to exclude such observations I would probably do something like: >> >> gen byte touse = abs(var) <= 1e6 >> reg y var x if touse >> >> -reg- could be any command, the key is the -if touse- part. The variable touse will contain 0s and 1s such that those non-extreme values get 1 (true) and the extreme values get 0 (false), see: >> <http://www.stata.com/support/faqs/data/trueorfalse.html>. The reason why I prefer this is that it does not destroy any information in my dataset. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How to find extreme values***From:*"Sandy Y. Zhu" <sandy.zhu@yale.edu>

**Re: st: How to find extreme values***From:*Barth Riley <barthriley@comcast.net>

**Re: st: How to find extreme values***From:*Maarten Buis <maartenlbuis@gmail.com>

**RE: st: How to find extreme values***From:*"Nakelse, Tebila (AfricaRice)" <T.Nakelse@cgiar.org>

**Re: st: How to find extreme values***From:*Nick Cox <njcoxstata@gmail.com>

**RE: st: How to find extreme values***From:*"Nakelse, Tebila (AfricaRice)" <T.Nakelse@cgiar.org>

- Prev by Date:
**Re: st: differences of mean** - Next by Date:
**Re: st: differences of mean** - Previous by thread:
**Re: st: How to find extreme values** - Next by thread:
**Re: st: How to find extreme values** - Index(es):