Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: RE: RE: IQR


From   "Rodrigo A. Alfaro" <[email protected]>
To   <[email protected]>
Subject   st: Re: RE: RE: IQR
Date   Wed, 6 Jun 2007 19:43:16 -0400

///
Wow Nick, your translation from 'demotic' can be only compared with the work of Thomas Young. Just kidding, very good job indeed!! Returning to the problem, it would be nice to get a list of return scalars in your new version. For the problem, the limits were the observations are supposed to be outliers can be used after for sample selection or to create new variables.

Alternative to the procedure discussed so far, there is another way to 'deal' with the outlier (if you want to), which is cutting the tails "we trimmed firms whose total assets growth rate exceed the 90th percentile or fall short of the 10th percentile of the annual distribution." page 6 of Baum, Caglayan, Ozkan (2003), Working Paper 566, Boston College. For example, tdavis could use the following code to 'drop' the outliers that are above of 5th and 95th percentile of each variable:

foreach x of varlist price total_assets inventories {
gen double `x'_wo = `x'
sum `x', d
local u = r(p95)
local l = r(p5)
replace `x'_wo = . if `x'>`u'
replace `x'_wo = . if `x'<`l'
}

Rodrigo.



----- Original Message ----- From: "Nick Cox" <[email protected]>
To: <[email protected]>
Sent: Wednesday, June 06, 2007 6:53 PM
Subject: st: RE: RE: IQR



I spent a while updating -iqr- to -iqr8-.

This was unnecessary, because -iqr- works
fine under version control. (How many programs
would run without change in other software after 16 years?)
Nevertheless, few Stata users will now be accustomed to reading
or writing Stata like this:

---------------------------- iqr (L. Hamilton)
*Outliers and summary statistics for simple normality check.
program define iqr
version 2.1
capture drop tempvar
mac def _varlist "required existing max(1)"
mac def _if "optional"
mac def _in "optional"
parse "%_*"
quietly generate tempvar=%_1 %_if %_in
quietly summ tempvar, detail
macro define NN=_result(1)
display
#delimit ;
di in gr "   mean= " in ye %6.0g _result(3) in gr _col(20)
     "       std.dev.= " in ye %6.0g sqrt(_result(4))
     in gr _col(50) "   (n= " in ye _result(1) in gr ")";
di in gr " median= " in ye %6.0g _result(10)
     in gr _col(20) "pseudo std.dev.= "
     in ye %6.0g (_result(11)-_result(9))/1.349
     in gr _col(50) " (IQR= " in ye %6.0g _result(11)-_result(9)
     in gr ")";
mac define _lowfen1=_result(9)-1.5*(_result(11)-_result(9));
mac define _hifen1=_result(11)+1.5*(_result(11)-_result(9));
mac define _lowfen2=_result(9)-3*(_result(11)-_result(9));
mac define _hifen2= _result(11)+3*(_result(11)-_result(9));
quietly summ tempvar if tempvar>_result(8) & tempvar<_result(12);
di in gr "10 trim= " in ye %6.0g _result(3);
di in gr _col(48) "low " _col(60) "high";
di in gr _col(48) "-------------------";
di in gr _col(28) "     inner fences " in ye _col(48) %6.0g %_lowfen1
     _col(60) %6.0g %_hifen1;
quietly count if tempvar<%_lowfen1 & tempvar>=%_lowfen2 & tempvar~=.;
macro define _loout1=_result(1);
quietly count if tempvar>%_hifen1 & tempvar<=%_hifen2 & tempvar~=.;
macro define _hiout1=_result(1);
di in gr _col(28) "# mild outliers   " in ye _col(48) %_loout1
     _col(60) %_hiout1;
di in gr _col(28) "% mild outliers   "
     in ye _col(48) %4.2f 100*%_loout1/%NN in gr "%"
     in ye _col(60) %4.2f 100*%_hiout1/%NN in gr "%";
display;
di in gr _col(28) "     outer fences " in ye _col(48) %6.0g %_lowfen2
     _col(60) %6.0g %_hifen2;
quietly count if tempvar<%_lowfen2 & tempvar~=.;
macro define _loout2=_result(1);
quietly count if tempvar>%_hifen2 & tempvar~=.;
macro define _hiout2=_result(1);
di in gr _col(28) "# severe outliers " in ye _col(48) %_loout2
     _col(60) %_hiout2;
di in gr _col(28) "% severe outliers "
     in ye _col(48) %4.2f 100*%_loout2/%NN in gr "%"
     in ye _col(60) %4.2f 100*%_hiout2/%NN in gr "%";
display;
capture drop tempvar;
#delimit cr
end
---------------------

This is not a literal translation. I changed some small
details as a matter of taste. In 16 years, there are also
some changes in how Stata programmers would typically do things.

--------------------- iqr8
*! NJC 1.0.0 6 June 2007
* Stata 8 rendering of -iqr- from STB-3 sed4 1991 for Stata 2.1
* Outliers and summary statistics for simple normality check
program iqr8
       version 8
       syntax varname(numeric) [if] [in]
       marksample touse
       local v "`varlist'"

       quietly su `v' if `touse', detail
       local N = r(N)

       di _n as txt "   mean = "           as res %8.0g r(mean)  ///
             as txt "{col 29}std dev. = " as res %8.0g r(sd)     ///
             as txt "{col 53}n = "  as res %8.0f r(N)

       di as txt " median = " as res %8.0g r(p50)                ///
          as txt "{col 22}pseudo std dev. = "                    ///
          as res %8.0g (r(p75) - r(p25)) / 1.349                 ///
          as txt "{col 51}IQR = " as res %8.0g r(p75) - r(p25)

       tempname lowfen1 hifen1 lowfen2 hifen2
       scalar `lowfen1' = r(p25) - 1.5 * (r(p75) - r(p25))
       scalar `hifen1'  = r(p75) + 1.5 * (r(p75) - r(p25))
       scalar `lowfen2' = r(p25) - 3 * (r(p75) - r(p25))
       scalar `hifen2'  = r(p75) + 3 * (r(p75) - r(p25))

       su `v' if `v' > r(p10) & `v' < r(p90) & `touse', meanonly

       di as txt "10 trim = " as res %8.0g r(mean)
       di as txt "{col 48}low" "{col 62}high"
       di as txt "{col 48}{hline 22}"
       di as txt "{col 33}inner fences " ///
          as res "{col 48}" %8.0g `lowfen1' "{col 62}" %8.0g `hifen1'

       quietly count if `v' < `lowfen1' & `v' >= `lowfen2' & `touse'
       local loout1 = r(N)
       quietly count if `v' > `hifen1' & `v' <= `hifen2' & `touse'
       local hiout1 = r(N)

       di as txt "{col 30}# mild outliers"          ///
          as res "{col 48}`loout1'" "{col 62}`hiout1'"

       di as txt "{col 30}% mild outliers"          ///
          as res "{col 48}" %4.2f 100 * `loout1' / `N' ///
          as res "{col 62}" %4.2f 100 * `hiout1' / `N'

       di _n as txt "{col 33}outer fences"     ///
             as res "{col 48}" %8.0g `lowfen2' "{col 62}" %8.0g `hifen2'

       quietly count if `v' < `lowfen2' & `touse'
       local loout2 = r(N)
       quietly count if `v' > `hifen2' & `touse'
       local hiout2 = r(N)

       di as txt "{col 28}# severe outliers"             ///
          as res "{col 48}`loout2'" "{col 62}`hiout2'"
       di as txt "{col 28}% severe outliers"             ///
          as res "{col 48}" %4.2f 100 * `loout2' / `N'   ///
          as res "{col 62}" %4.2f 100 * `hiout2' / `N'
end
-----------------------

Nick
[email protected]

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Nick Cox
Sent: 06 June 2007 23:19
To: [email protected]
Subject: st: RE: IQR


I think you are referring to -iqr- from STB-3. The FAQ
advises

"Say what command(s) you are using. If they are not part of
official Stata, say where they come from: the STB/SJ, SSC, or
other archives."

As you did not follow this advice, I had to puzzle out
which command you were referring to. No doubt various
other members of the list fell at the first fence.

Now as to your question: I do not understand what you
do not understand. The help for -iqr- looks very helpful
to me. It includes these definitions:

============================================================
 IQR (Interquartile Range) = 75th percentile - 25th percentile
      Pseudo standard deviation = IQR/1.349
      10% trim mean             = Average of cases between 10th and
                                     90th percentiles
      Inner fences              = Q(25)-1.5IQR and Q(75)+1.5IQR
      Outer fences              = Q(25)-3IQR   and Q(75)+3IQR
      Mild outlier              = Q(25)-3IQR <= x < Q(25)-1.5IQR  or
                                  Q(75)+1.5IQR < x <= Q(75)+3IQR
      Severe outlier            = x < Q(25)-3IQR  or  x > Q(75)+3IQR
=============================================================

Thus a "severe outlier" lies more than 3 IQR away from the nearer
quartile and a "mild outlier" lies more than 1.5 (but not
more than 3)
IQR away from the nearer quartile.

These definitions go back to J.W. Tukey. 1977. Exploratory data
analysis. Reading, MA: Addison-Wesley, except that the definitions
of quartiles Stata uses are documented at [R] summarize.

These are arbitrary limits. Their main interest is that they are
sometimes used in boxplots to determine which data points should
be shown individually.

That said, "getting rid" of severe outliers is, in my view, not
usually a good idea unless there is independent evidence that
the data are wholly untrustworthy (e.g. a laboratory record that
the experiment was grossly disturbed). Dropping values more than
3 IQR away from the nearer quartile will in most instances throw
out important information. It would throw away most major cities
compared with cities in their country.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index