Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: How to find extreme values

 From Nick Cox <[email protected]> To [email protected] Subject Re: st: How to find extreme values Date Tue, 27 Mar 2012 09:37:58 +0100

```See also -extremes- from SSC. For example,

. sysuse auto
(1978 Automobile Data)

. extremes price

+--------------+
| obs:   price |
|--------------|
|  34.   3,291 |
|  14.   3,299 |
|  18.   3,667 |
|  68.   3,748 |
|  66.   3,798 |
+--------------+

+---------------+
|  64.   12,990 |
|  28.   13,466 |
|  27.   13,594 |
|  12.   14,500 |
|  13.   15,906 |
+---------------+

. extremes price, iqr sep(0)

+-----------------------+
| obs:    iqr:    price |
|-----------------------|
|  53.   1.559    9,690 |
|  55.   1.580    9,735 |
|  41.   1.877   10,371 |
|   9.   1.877   10,372 |
|  11.   2.349   11,385 |
|  26.   2.401   11,497 |
|  74.   2.633   11,995 |
|  64.   3.096   12,990 |
|  28.   3.318   13,466 |
|  27.   3.378   13,594 |
|  12.   3.800   14,500 |
|  13.   4.455   15,906 |
+-----------------------+

The table shows that on the usual boxplot rule there are 12 outliers
in -price-, not 8. Tebila missed some overlapping symbols. -graph box
price, marker(1, ms(Oh))- makes the fact of overlap easier to see. As
I understand it -graph box- does not support jittering (at this moment
I am using an ancient Stata).

Also, an outlier being at least 1.25 * iqr away from the nearer
quartile is Tebila's convention, but it is not that used by -graph
box-. -extremes- allows you to set your own multiplier.

A further note of caution: Suppose you reject outliers as being more
than so many "deviations" away from some reference "level", then
recalculate (the definitions of quoted terms being up to you). You
might need to iterate more than once, as new outliers could be
identified at each stage.

All that said, I would rather identify outliers than delete them, as

Nick

On Tue, Mar 27, 2012 at 9:19 AM, Nakelse, Tebila (AfricaRice)
<[email protected]> wrote:

> Find below an example of identification and correction of extreme value  .
>
> *** correction of  the variable price
>
> sysuse auto, clear
>
> /* plot to visualize the extreme*/
> graph box price
>
> /* we can distinguish 8 extremes values*/
>
> ***  quartiles of  price
>
> egen Q1_price= pctile(price), p(25)
> egen Q3_price= pctile(price), p(75)
> egen  IC_price= iqr(price)
>
> ***Identification of extreme value
>
> gen touse=1 if (price< Q1_price-1.25*IC_price| price> Q3_price+1.25*IC_price) & missing(price)==0
> recode touse . =0
>
> tab touse
>
> ***Correction of the price
> gen pricec =price
> replace pricec  =Q1_price-1.25*IC_price if price < Q1_price-1.25*IC_price & touse==1
> replace pricec =Q3_price+1.25*IC_price if price> Q3_price+1.25*IC_price &  touse==1
>
> /* the corrected price box plot to see if the extreme value remain*/
> *graph  box pricec

Maarten Buis
>
> On Tue, Mar 27, 2012 at 5:24 AM, Barth Riley <[email protected]> wrote:
>> To remove outliers, you could:
>>
>> preserve
>> replace var = . if abs(var) >= 1000000 (or some other value) [perform
>> analyses] restore
>>
>> preserve and restore are added if you want to make a temporary change
>> to these values
>
> If I were to exclude such observations I would probably do something like:
>
> gen byte touse = abs(var) <= 1e6
> reg y var x if touse
>
> -reg- could be any command, the key is the -if touse- part. The variable touse will contain 0s and 1s such that those non-extreme values get 1 (true) and the extreme values get 0 (false), see:
> <http://www.stata.com/support/faqs/data/trueorfalse.html>. The reason why I prefer this is that it does not destroy any information in my dataset.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```