Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Outlier: Detection


From   "Sergiy Radyakin" <[email protected]>
To   [email protected]
Subject   Re: st: RE: Outlier: Detection
Date   Tue, 19 Feb 2008 16:32:38 -0500

Hello Badri,

I don't know any of the above two procedures, but here is the one I
wrote for detecting outlying codes in the categorical variables (e.g.
99 in 1,2,3,99). If this is what you are looking for. It returns
r(outliers) if any are found.

All comments, corrections, critique are welcomed. It would be
important to know if it makes sense, since an important routine is
using it.

// ============= BEGIN GRUBBS.ADO =======================

program Grubbs, rclass

/* Sergiy RADYAKIN, 2007
   First argument -- variable name, second argument -- significance level alpha
   Returns: macro outliers = list of outliers */

	quietly inspect `1'
	local n=r(N_unique)
	if `n'>90 exit

	local t=invttail(`n'-2,`2'/(2*`n'))                         // t-value
	local t2=`t'*`t'
	local G_cr=((`n'-1)/sqrt(`n'))*sqrt(`t2'/(`n'-2+`t2'))      //
Critical value of the Grubb's test

	quietly levelsof `1', local(levs)                           // Get
the list of the values

	local levsum=0
	local sqsum=0
	foreach lev of local levs {
	   local levsum=`levsum'+`lev'          // Sum of values
	   local sqsum=`sqsum'+`lev'*`lev'      // Sum of squares of values
	}

	local mean=`levsum'/`n'
	local levsdev=sqrt(`sqsum'/`n'-`mean'*`mean')

	local outliers
	foreach lev of local levs {
		local Z=abs(`mean'-`lev')/`levsdev'
		if `Z'>`G_cr' local outliers "`outliers' `lev'"
	}

	return local outliers="`outliers'"
end // program Grubbs

// ================== END OF GRUBBS.ADO =================


Best regards,
   Sergiy Radyakin





On 2/19/08, [email protected]
<[email protected]> wrote:
> > Hi stata users,
> >
> > I am trying to run a stata program to detecet outlier in my data set. I found 2 grubbs programs written in stata. Programs are here:
> >
> > Program # 1.
> > _______________________________program begins_____________________________________________
> **************************************
> * This is grubbs.ado beta version
> * Date: Jan, 20,2007
> * Version: 1.1
> *
> * Questions, comments and bug reports :
> * [email protected]
> *
> * Version history :
> * v1.1: - Bug correction (odd behavior when -if- is specified).
> *       - Changes in the default variable names (grubbs_xx2, ...)
> * v1.0: Initial release.
> * Initial code by A.-C. Disdier and K. Head (available at http://strategy.sauder.ubc.ca/head/grubbs.ado)
> **************************************
>
> set more off
> cap prog drop grubbs
> program grubbs
> version 8.0
>
> syntax [varlist(default=none)] [if] [in], [GENerate(string)] [DRop] [LOg] [ITer(integer 16000)] [LEVel(real 95.0)]
>
> ********************
> * Verifying syntax
> ********************
> if "`varlist'"=="" {
>        di as error "varlist required"
>        exit 198
> }
> if `level'>100 | `level'<1 {
>        di as error "level() specifies the confidence level, as a percentage. It must be between 1.0 and 100.0"
>        exit
> }
> scalar conf=(100-`level')/100
> if `iter'<0 {
>        di as error "iter() must be an integer above 0"
>        exit
> }
> if "`drop'"!="" & "`generate'"!="" {
>        di as error "drop skipped because of generate()"
>        local drop=""
> }
> marksample touse
>
> ********************
> * Grubbs procedure
> ********************
>
> scalar nbvar=wordcount("`varlist'")
> scalar nbnewvar=wordcount("`generate'")
> if nbnewvar!=nbvar & nbnewvar!=0 {
>        di as error "Number of variable names in generate() not equal to number of var, skip to default names"
>        local generate=""
> }
> tokenize `"`generate'"'
> foreach var of local varlist {
>        di as result "Variable: `var' " _continue
>        tempvar centred varmq
>        local varname="grubbs_`var'"
>        if ("`generate'"!="") local varname="`1'"
>        capture confirm new var `varname'
>        local tempvarname="`varname'"
>        local i=1
>        while _rc==110 {
>                local varname="`tempvarname'`i'"
>                local i=`i'+1
>                capture confirm new var `varname'
>        }
>        di "(0/1 variable recording which observations are outliers: `varname')."
>        gen byte `varname'=0
>        local i = 1
>        gen byte `varmq' =(`var'==. | `touse'!=1)
>        scalar cutoff = 10
>        scalar G = cutoff +1
>        while G > cutoff & `i'<= `iter' {
>                qui sum `var' if `varname' == 0 & `touse'
>                gen `centred' = (abs(`var' -r(mean)))/r(sd)
>                gsort -`varmq' -`varname' `centred'
>                scalar cutoff = (r(N)-1)*sqrt(invttail(r(N)-2,conf/(2*r(N)))^2/(r(N)*(r(N)-2+invttail(r(N)-2,conf/(2*r(N)))^2)))
>                scalar G = `centred'[_N]
>                if ("`log'"!="" & G > cutoff) {
>                        di as txt "Iteration = " `i' ". T-value: " %5.4f G " so " `var'[_N] " is an outlier"
>                }
>                qui replace `varname'= 1 if `centred' == G & G > cutoff
>                local i = `i'+1
>                drop `centred'
>        }
>        local j=`i'-2
>        if (`i'<=`iter') di as result "`j' outliers. No more outliers"
>        else di as error "more than `j' outliers. Increase number of iterations"
>        drop `varmq'
>        mac shift
> }
> if "`drop'"!="" {
>        tempvar del
>        egen `del'=rowtotal(grubbs_*)
>        drop if `del'!=0
>        drop `del'
>        capture drop grubbs_*
> }
> end
> > ____________________________________end of program__________________________________
> >
> >
> > Program #2.
> > __________________________________________program begins___________________________
> > program define grubbs
> > * this is a revised version of the original command, it no longer deletes missing obs or outlier
> > * instead, it sets "tag_grubbs" =1 if it believes the obs to be an outlier
> > * usage: "grubbs myvar .05 10"
> > version 8.0
> > *arguments:
> > * 1= Name of variable
> > * 2= Confidence interval (0.05 or 0.01)
> > * 3= Max number of iterations
> > args xvar conf maxit
> > tempvar dev missx
> > gen byte tag_grubbs = 0>
> > local i = 1
> > di "deleting missing values"
> > gen byte missx =  `xvar'==.
> > * initial guess for critical value
> > scalar Gcrit = 10
> > * start with G > Gcrit (otherwise loop will not begin)
> > scalar G = Gcrit +1
> > di "maxit = " `maxit'
> > di "G= " G
> > di "Gcrit = " Gcrit
> >          while G > Gcrit & `i'<= `maxit' {
> >                    sum `xvar' if tag_grubbs == 0
> >                    local nobs = r(N)
> >                    gen `dev' = (abs(`xvar' -r(mean)))/r(sd)
> >                    gsort -`missx' -tag_grubbs `dev'
> >                    scalar G = `dev'[_N]
> >                    local ct = `conf'/(2*`nobs')
> >                    local ts = invttail(`nobs'-2,`ct')
> >                    scalar Gcrit = (`nobs'-1)*sqrt(`ts'^2/(`nobs'*(`nobs'-2+`ts'^2)))
> >                    di "Iteration = " `i' "   Critical G = " Gcrit " Current G = " G
> >                    if (G > Gcrit) di `xvar'[_N] " is an outlier, so tag_grubbs = 1"
> >                    replace tag_grubbs = 1 if  `dev' == G & G > Gcrit
> >                    local i = `i'+1
> >                    drop `dev'
> > }
> >
> > if (`i'<=`maxit') di "Grubbs procedure terminated: no more outliers"
> > else di "Maximum iterations exceeded: Use larger maxit"
> > end
> > ___________________________________end of program__________________________________________
> >
> > Could anyone suggest me which program is better to use. I will appreciate if you please use auto data for variable price as an example to run these programs.
> >
> > Thanks.
> >
> > Badri Prasad
> > Policy, Reporting and Data Development
> > Labour Standards and Workplace Equity
> > National Labour Operations Directorate
> > HRSDC
> > (819) 956 - 8146
> >
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index