Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Outlier: Detection


From   <badri.prasad@hrsdc-rhdsc.gc.ca>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Outlier: Detection
Date   Tue, 19 Feb 2008 15:17:16 -0500

> Hi stata users,
> 
> I am trying to run a stata program to detecet outlier in my data set. I found 2 grubbs programs written in stata. Programs are here:
> 
> Program # 1.
> _______________________________program begins_____________________________________________
**************************************
* This is grubbs.ado beta version
* Date: Jan, 20,2007
* Version: 1.1
* 
* Questions, comments and bug reports : 
* couderc@univ-paris1.fr
*
* Version history :
* v1.1: - Bug correction (odd behavior when -if- is specified). 
*       - Changes in the default variable names (grubbs_xx2, ...)
* v1.0: Initial release. 
* Initial code by A.-C. Disdier and K. Head (available at http://strategy.sauder.ubc.ca/head/grubbs.ado)
**************************************

set more off
cap prog drop grubbs
program grubbs
version 8.0

syntax [varlist(default=none)] [if] [in], [GENerate(string)] [DRop] [LOg] [ITer(integer 16000)] [LEVel(real 95.0)] 

********************
* Verifying syntax
********************
if "`varlist'"=="" {
	di as error "varlist required"
	exit 198
}
if `level'>100 | `level'<1 {
	di as error "level() specifies the confidence level, as a percentage. It must be between 1.0 and 100.0"
	exit
}
scalar conf=(100-`level')/100
if `iter'<0 {
	di as error "iter() must be an integer above 0"
	exit
}
if "`drop'"!="" & "`generate'"!="" {
	di as error "drop skipped because of generate()"
	local drop=""
}
marksample touse

********************
* Grubbs procedure
********************

scalar nbvar=wordcount("`varlist'")
scalar nbnewvar=wordcount("`generate'")
if nbnewvar!=nbvar & nbnewvar!=0 {
	di as error "Number of variable names in generate() not equal to number of var, skip to default names"
	local generate=""
}
tokenize `"`generate'"'
foreach var of local varlist {
	di as result "Variable: `var' " _continue
	tempvar centred varmq
	local varname="grubbs_`var'"
	if ("`generate'"!="") local varname="`1'"
	capture confirm new var `varname'
	local tempvarname="`varname'"
	local i=1
	while _rc==110 {
		local varname="`tempvarname'`i'"
		local i=`i'+1
		capture confirm new var `varname'
	}
	di "(0/1 variable recording which observations are outliers: `varname')."
	gen byte `varname'=0
	local i = 1
	gen byte `varmq' =(`var'==. | `touse'!=1)
	scalar cutoff = 10
	scalar G = cutoff +1
        while G > cutoff & `i'<= `iter' {
		qui sum `var' if `varname' == 0 & `touse'
		gen `centred' = (abs(`var' -r(mean)))/r(sd)
		gsort -`varmq' -`varname' `centred'
		scalar cutoff = (r(N)-1)*sqrt(invttail(r(N)-2,conf/(2*r(N)))^2/(r(N)*(r(N)-2+invttail(r(N)-2,conf/(2*r(N)))^2)))
		scalar G = `centred'[_N]
		if ("`log'"!="" & G > cutoff) {
			di as txt "Iteration = " `i' ". T-value: " %5.4f G " so " `var'[_N] " is an outlier"
		}
		qui replace `varname'= 1 if `centred' == G & G > cutoff
		local i = `i'+1
		drop `centred' 
	}
	local j=`i'-2
	if (`i'<=`iter') di as result "`j' outliers. No more outliers"
	else di as error "more than `j' outliers. Increase number of iterations"
	drop `varmq'
	mac shift
}
if "`drop'"!="" {
	tempvar del
	egen `del'=rowtotal(grubbs_*)
	drop if `del'!=0
	drop `del' 
	capture drop grubbs_*
}
end
> ____________________________________end of program__________________________________
> 
> 
> Program #2.
> __________________________________________program begins___________________________
> program define grubbs
> * this is a revised version of the original command, it no longer deletes missing obs or outlier
> * instead, it sets "tag_grubbs" =1 if it believes the obs to be an outlier
> * usage: "grubbs myvar .05 10"
> version 8.0
> *arguments:
> * 1= Name of variable
> * 2= Confidence interval (0.05 or 0.01)
> * 3= Max number of iterations
> args xvar conf maxit
> tempvar dev missx
> gen byte tag_grubbs = 0> 
> local i = 1
> di "deleting missing values"
> gen byte missx =  `xvar'==.
> * initial guess for critical value
> scalar Gcrit = 10
> * start with G > Gcrit (otherwise loop will not begin)
> scalar G = Gcrit +1
> di "maxit = " `maxit'
> di "G= " G
> di "Gcrit = " Gcrit
>          while G > Gcrit & `i'<= `maxit' {
>                    sum `xvar' if tag_grubbs == 0
>                    local nobs = r(N)
>                    gen `dev' = (abs(`xvar' -r(mean)))/r(sd)
>                    gsort -`missx' -tag_grubbs `dev'
>                    scalar G = `dev'[_N]
>                    local ct = `conf'/(2*`nobs')
>                    local ts = invttail(`nobs'-2,`ct')
>                    scalar Gcrit = (`nobs'-1)*sqrt(`ts'^2/(`nobs'*(`nobs'-2+`ts'^2)))
>                    di "Iteration = " `i' "   Critical G = " Gcrit " Current G = " G
>                    if (G > Gcrit) di `xvar'[_N] " is an outlier, so tag_grubbs = 1"
>                    replace tag_grubbs = 1 if  `dev' == G & G > Gcrit
>                    local i = `i'+1
>                    drop `dev'
> }
> 
> if (`i'<=`maxit') di "Grubbs procedure terminated: no more outliers"
> else di "Maximum iterations exceeded: Use larger maxit"
> end
> ___________________________________end of program__________________________________________
> 
> Could anyone suggest me which program is better to use. I will appreciate if you please use auto data for variable price as an example to run these programs.
> 
> Thanks.
> 
> Badri Prasad
> Policy, Reporting and Data Development
> Labour Standards and Workplace Equity
> National Labour Operations Directorate
> HRSDC
> (819) 956 - 8146
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index