# Re: st: RE: Outlier: Detection

 From "Sergiy Radyakin" To statalist@hsphsun2.harvard.edu Subject Re: st: RE: Outlier: Detection Date Tue, 19 Feb 2008 16:32:38 -0500

```Hello Badri,

I don't know any of the above two procedures, but here is the one I
wrote for detecting outlying codes in the categorical variables (e.g.
99 in 1,2,3,99). If this is what you are looking for. It returns
r(outliers) if any are found.

All comments, corrections, critique are welcomed. It would be
important to know if it makes sense, since an important routine is
using it.

// ============= BEGIN GRUBBS.ADO =======================

program Grubbs, rclass

/* Sergiy RADYAKIN, 2007
First argument -- variable name, second argument -- significance level alpha
Returns: macro outliers = list of outliers */

quietly inspect `1'
local n=r(N_unique)
if `n'>90 exit

local t=invttail(`n'-2,`2'/(2*`n'))                         // t-value
local t2=`t'*`t'
local G_cr=((`n'-1)/sqrt(`n'))*sqrt(`t2'/(`n'-2+`t2'))      //
Critical value of the Grubb's test

quietly levelsof `1', local(levs)                           // Get
the list of the values

local levsum=0
local sqsum=0
foreach lev of local levs {
local levsum=`levsum'+`lev'          // Sum of values
local sqsum=`sqsum'+`lev'*`lev'      // Sum of squares of values
}

local mean=`levsum'/`n'
local levsdev=sqrt(`sqsum'/`n'-`mean'*`mean')

local outliers
foreach lev of local levs {
local Z=abs(`mean'-`lev')/`levsdev'
if `Z'>`G_cr' local outliers "`outliers' `lev'"
}

return local outliers="`outliers'"
end // program Grubbs

// ================== END OF GRUBBS.ADO =================

Best regards,

> > Hi stata users,
> >
> > I am trying to run a stata program to detecet outlier in my data set. I found 2 grubbs programs written in stata. Programs are here:
> >
> > Program # 1.
> > _______________________________program begins_____________________________________________
> **************************************
> * This is grubbs.ado beta version
> * Date: Jan, 20,2007
> * Version: 1.1
> *
> * Questions, comments and bug reports :
> * couderc@univ-paris1.fr
> *
> * Version history :
> * v1.1: - Bug correction (odd behavior when -if- is specified).
> *       - Changes in the default variable names (grubbs_xx2, ...)
> * v1.0: Initial release.
> * Initial code by A.-C. Disdier and K. Head (available at http://strategy.sauder.ubc.ca/head/grubbs.ado)
> **************************************
>
> set more off
> cap prog drop grubbs
> program grubbs
> version 8.0
>
> syntax [varlist(default=none)] [if] [in], [GENerate(string)] [DRop] [LOg] [ITer(integer 16000)] [LEVel(real 95.0)]
>
> ********************
> * Verifying syntax
> ********************
> if "`varlist'"=="" {
>        di as error "varlist required"
>        exit 198
> }
> if `level'>100 | `level'<1 {
>        di as error "level() specifies the confidence level, as a percentage. It must be between 1.0 and 100.0"
>        exit
> }
> scalar conf=(100-`level')/100
> if `iter'<0 {
>        di as error "iter() must be an integer above 0"
>        exit
> }
> if "`drop'"!="" & "`generate'"!="" {
>        di as error "drop skipped because of generate()"
>        local drop=""
> }
> marksample touse
>
> ********************
> * Grubbs procedure
> ********************
>
> scalar nbvar=wordcount("`varlist'")
> scalar nbnewvar=wordcount("`generate'")
> if nbnewvar!=nbvar & nbnewvar!=0 {
>        di as error "Number of variable names in generate() not equal to number of var, skip to default names"
>        local generate=""
> }
> tokenize `"`generate'"'
> foreach var of local varlist {
>        di as result "Variable: `var' " _continue
>        tempvar centred varmq
>        local varname="grubbs_`var'"
>        if ("`generate'"!="") local varname="`1'"
>        capture confirm new var `varname'
>        local tempvarname="`varname'"
>        local i=1
>        while _rc==110 {
>                local varname="`tempvarname'`i'"
>                local i=`i'+1
>                capture confirm new var `varname'
>        }
>        di "(0/1 variable recording which observations are outliers: `varname')."
>        gen byte `varname'=0
>        local i = 1
>        gen byte `varmq' =(`var'==. | `touse'!=1)
>        scalar cutoff = 10
>        scalar G = cutoff +1
>        while G > cutoff & `i'<= `iter' {
>                qui sum `var' if `varname' == 0 & `touse'
>                gen `centred' = (abs(`var' -r(mean)))/r(sd)
>                gsort -`varmq' -`varname' `centred'
>                scalar cutoff = (r(N)-1)*sqrt(invttail(r(N)-2,conf/(2*r(N)))^2/(r(N)*(r(N)-2+invttail(r(N)-2,conf/(2*r(N)))^2)))
>                scalar G = `centred'[_N]
>                if ("`log'"!="" & G > cutoff) {
>                        di as txt "Iteration = " `i' ". T-value: " %5.4f G " so " `var'[_N] " is an outlier"
>                }
>                qui replace `varname'= 1 if `centred' == G & G > cutoff
>                local i = `i'+1
>                drop `centred'
>        }
>        local j=`i'-2
>        if (`i'<=`iter') di as result "`j' outliers. No more outliers"
>        else di as error "more than `j' outliers. Increase number of iterations"
>        drop `varmq'
>        mac shift
> }
> if "`drop'"!="" {
>        tempvar del
>        egen `del'=rowtotal(grubbs_*)
>        drop if `del'!=0
>        drop `del'
>        capture drop grubbs_*
> }
> end
> > ____________________________________end of program__________________________________
> >
> >
> > Program #2.
> > __________________________________________program begins___________________________
> > program define grubbs
> > * this is a revised version of the original command, it no longer deletes missing obs or outlier
> > * instead, it sets "tag_grubbs" =1 if it believes the obs to be an outlier
> > * usage: "grubbs myvar .05 10"
> > version 8.0
> > *arguments:
> > * 1= Name of variable
> > * 2= Confidence interval (0.05 or 0.01)
> > * 3= Max number of iterations
> > args xvar conf maxit
> > tempvar dev missx
> > gen byte tag_grubbs = 0>
> > local i = 1
> > di "deleting missing values"
> > gen byte missx =  `xvar'==.
> > * initial guess for critical value
> > scalar Gcrit = 10
> > * start with G > Gcrit (otherwise loop will not begin)
> > scalar G = Gcrit +1
> > di "maxit = " `maxit'
> > di "G= " G
> > di "Gcrit = " Gcrit
> >          while G > Gcrit & `i'<= `maxit' {
> >                    sum `xvar' if tag_grubbs == 0
> >                    local nobs = r(N)
> >                    gen `dev' = (abs(`xvar' -r(mean)))/r(sd)
> >                    gsort -`missx' -tag_grubbs `dev'
> >                    scalar G = `dev'[_N]
> >                    local ct = `conf'/(2*`nobs')
> >                    local ts = invttail(`nobs'-2,`ct')
> >                    scalar Gcrit = (`nobs'-1)*sqrt(`ts'^2/(`nobs'*(`nobs'-2+`ts'^2)))
> >                    di "Iteration = " `i' "   Critical G = " Gcrit " Current G = " G
> >                    if (G > Gcrit) di `xvar'[_N] " is an outlier, so tag_grubbs = 1"
> >                    replace tag_grubbs = 1 if  `dev' == G & G > Gcrit
> >                    local i = `i'+1
> >                    drop `dev'
> > }
> >
> > if (`i'<=`maxit') di "Grubbs procedure terminated: no more outliers"
> > else di "Maximum iterations exceeded: Use larger maxit"
> > end
> > ___________________________________end of program__________________________________________
> >
> > Could anyone suggest me which program is better to use. I will appreciate if you please use auto data for variable price as an example to run these programs.
> >
> > Thanks.
> >
> > Policy, Reporting and Data Development
> > Labour Standards and Workplace Equity
> > National Labour Operations Directorate
> > HRSDC
> > (819) 956 - 8146
> >
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```