Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Outlier: Detection


From   "Sergiy Radyakin" <serjradyakin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: Outlier: Detection
Date   Wed, 20 Feb 2008 11:32:01 -0500

The word "Critical" is supposed to be a comment "Critical value...".
Statalist's engine wraps the lines. I guess it was not intended for
sharing code, but rather simple messages.
I have changed //comments to /* comments */.

Note that this code does not automatically drop the outliers (as was
Maarten's concern). It is still your decision what you do with them.
The purpose of the code was alerting the programmer about the abnormal
values in categorical variables, e.g. codes like 97, 98, 997 etc, used
to represent certain missing values, not applicables, no infos, etc.

Regards, Sergiy


// ============= BEGIN GRUBBS.ADO =======================

program Grubbs, rclass

/* Sergiy RADYAKIN, 2007
  First argument -- variable name, second argument -- significance level alpha
  Returns: macro outliers = list of outliers */

       quietly inspect `1'
       local n=r(N_unique)
       if `n'>90 exit

       local t=invttail(`n'-2,`2'/(2*`n'))                         /* t-value */
       local t2=`t'*`t'
       local G_cr=((`n'-1)/sqrt(`n'))*sqrt(`t2'/(`n'-2+`t2'))      /*
Critical value of the Grubb's test */

       quietly levelsof `1', local(levs)       /* Get the list of the values */

       local levsum=0
       local sqsum=0
       foreach lev of local levs {
          local levsum=`levsum'+`lev'          /* Sum of values */
          local sqsum=`sqsum'+`lev'*`lev'      /* Sum of squares of values */
       }

       local mean=`levsum'/`n'
       local levsdev=sqrt(`sqsum'/`n'-`mean'*`mean')

       local outliers
       foreach lev of local levs {
               local Z=abs(`mean'-`lev')/`levsdev'
               if `Z'>`G_cr' local outliers "`outliers' `lev'"
       }

       return local outliers="`outliers'"
end /* program Grubbs */

// ================== END OF GRUBBS.ADO =================


On 2/20/08, badri.prasad@hrsdc-rhdsc.gc.ca
<badri.prasad@hrsdc-rhdsc.gc.ca> wrote:
> Hi Sergiy,
>
> I run your program with auto data using variable mpg as example (nothing to worry for criticacl values exits or not) but I got the following result.
>
> . Grubbs mpg .05
> unrecognized command:  Critical
> r(199);
>
> Could you please let me know what is wrong with it. Also, could you please let me know how to use your program using mpg variable with output (result) you got.
>
> With Regards,
>
> Badri Prasad
> Policy, Reporting and Data Development
> Labour Standards and Workplace Equity
> National Labour Operations Directorate
> HRSDC
> (819) 956 - 8146
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
> Sent: 2008-02-19 4:33 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: RE: Outlier: Detection
>
>
> Hello Badri,
>
> I don't know any of the above two procedures, but here is the one I
> wrote for detecting outlying codes in the categorical variables (e.g.
> 99 in 1,2,3,99). If this is what you are looking for. It returns
> r(outliers) if any are found.
>
> All comments, corrections, critique are welcomed. It would be
> important to know if it makes sense, since an important routine is
> using it.
>
> // ============= BEGIN GRUBBS.ADO =======================
>
> program Grubbs, rclass
>
> /* Sergiy RADYAKIN, 2007
>   First argument -- variable name, second argument -- significance level alpha
>   Returns: macro outliers = list of outliers */
>
>        quietly inspect `1'
>        local n=r(N_unique)
>        if `n'>90 exit
>
>        local t=invttail(`n'-2,`2'/(2*`n'))                         // t-value
>        local t2=`t'*`t'
>        local G_cr=((`n'-1)/sqrt(`n'))*sqrt(`t2'/(`n'-2+`t2'))      //
> Critical value of the Grubb's test
>
>        quietly levelsof `1', local(levs)                           // Get
> the list of the values
>
>        local levsum=0
>        local sqsum=0
>        foreach lev of local levs {
>           local levsum=`levsum'+`lev'          // Sum of values
>           local sqsum=`sqsum'+`lev'*`lev'      // Sum of squares of values
>        }
>
>        local mean=`levsum'/`n'
>        local levsdev=sqrt(`sqsum'/`n'-`mean'*`mean')
>
>        local outliers
>        foreach lev of local levs {
>                local Z=abs(`mean'-`lev')/`levsdev'
>                if `Z'>`G_cr' local outliers "`outliers' `lev'"
>        }
>
>        return local outliers="`outliers'"
> end // program Grubbs
>
> // ================== END OF GRUBBS.ADO =================
>
>
> Best regards,
>   Sergiy Radyakin
>
>
>
>
>
> On 2/19/08, badri.prasad@hrsdc-rhdsc.gc.ca
> <badri.prasad@hrsdc-rhdsc.gc.ca> wrote:
> > > Hi stata users,
> > >
> > > I am trying to run a stata program to detecet outlier in my data set. I found 2 grubbs programs written in stata. Programs are here:
> > >
> > > Program # 1.
> > > _______________________________program begins_____________________________________________
> > **************************************
> > * This is grubbs.ado beta version
> > * Date: Jan, 20,2007
> > * Version: 1.1
> > *
> > * Questions, comments and bug reports :
> > * couderc@univ-paris1.fr
> > *
> > * Version history :
> > * v1.1: - Bug correction (odd behavior when -if- is specified).
> > *       - Changes in the default variable names (grubbs_xx2, ...)
> > * v1.0: Initial release.
> > * Initial code by A.-C. Disdier and K. Head (available at http://strategy.sauder.ubc.ca/head/grubbs.ado)
> > **************************************
> >
> > set more off
> > cap prog drop grubbs
> > program grubbs
> > version 8.0
> >
> > syntax [varlist(default=none)] [if] [in], [GENerate(string)] [DRop] [LOg] [ITer(integer 16000)] [LEVel(real 95.0)]
> >
> > ********************
> > * Verifying syntax
> > ********************
> > if "`varlist'"=="" {
> >        di as error "varlist required"
> >        exit 198
> > }
> > if `level'>100 | `level'<1 {
> >        di as error "level() specifies the confidence level, as a percentage. It must be between 1.0 and 100.0"
> >        exit
> > }
> > scalar conf=(100-`level')/100
> > if `iter'<0 {
> >        di as error "iter() must be an integer above 0"
> >        exit
> > }
> > if "`drop'"!="" & "`generate'"!="" {
> >        di as error "drop skipped because of generate()"
> >        local drop=""
> > }
> > marksample touse
> >
> > ********************
> > * Grubbs procedure
> > ********************
> >
> > scalar nbvar=wordcount("`varlist'")
> > scalar nbnewvar=wordcount("`generate'")
> > if nbnewvar!=nbvar & nbnewvar!=0 {
> >        di as error "Number of variable names in generate() not equal to number of var, skip to default names"
> >        local generate=""
> > }
> > tokenize `"`generate'"'
> > foreach var of local varlist {
> >        di as result "Variable: `var' " _continue
> >        tempvar centred varmq
> >        local varname="grubbs_`var'"
> >        if ("`generate'"!="") local varname="`1'"
> >        capture confirm new var `varname'
> >        local tempvarname="`varname'"
> >        local i=1
> >        while _rc==110 {
> >                local varname="`tempvarname'`i'"
> >                local i=`i'+1
> >                capture confirm new var `varname'
> >        }
> >        di "(0/1 variable recording which observations are outliers: `varname')."
> >        gen byte `varname'=0
> >        local i = 1
> >        gen byte `varmq' =(`var'==. | `touse'!=1)
> >        scalar cutoff = 10
> >        scalar G = cutoff +1
> >        while G > cutoff & `i'<= `iter' {
> >                qui sum `var' if `varname' == 0 & `touse'
> >                gen `centred' = (abs(`var' -r(mean)))/r(sd)
> >                gsort -`varmq' -`varname' `centred'
> >                scalar cutoff = (r(N)-1)*sqrt(invttail(r(N)-2,conf/(2*r(N)))^2/(r(N)*(r(N)-2+invttail(r(N)-2,conf/(2*r(N)))^2)))
> >                scalar G = `centred'[_N]
> >                if ("`log'"!="" & G > cutoff) {
> >                        di as txt "Iteration = " `i' ". T-value: " %5.4f G " so " `var'[_N] " is an outlier"
> >                }
> >                qui replace `varname'= 1 if `centred' == G & G > cutoff
> >                local i = `i'+1
> >                drop `centred'
> >        }
> >        local j=`i'-2
> >        if (`i'<=`iter') di as result "`j' outliers. No more outliers"
> >        else di as error "more than `j' outliers. Increase number of iterations"
> >        drop `varmq'
> >        mac shift
> > }
> > if "`drop'"!="" {
> >        tempvar del
> >        egen `del'=rowtotal(grubbs_*)
> >        drop if `del'!=0
> >        drop `del'
> >        capture drop grubbs_*
> > }
> > end
> > > ____________________________________end of program__________________________________
> > >
> > >
> > > Program #2.
> > > __________________________________________program begins___________________________
> > > program define grubbs
> > > * this is a revised version of the original command, it no longer deletes missing obs or outlier
> > > * instead, it sets "tag_grubbs" =1 if it believes the obs to be an outlier
> > > * usage: "grubbs myvar .05 10"
> > > version 8.0
> > > *arguments:
> > > * 1= Name of variable
> > > * 2= Confidence interval (0.05 or 0.01)
> > > * 3= Max number of iterations
> > > args xvar conf maxit
> > > tempvar dev missx
> > > gen byte tag_grubbs = 0>
> > > local i = 1
> > > di "deleting missing values"
> > > gen byte missx =  `xvar'==.
> > > * initial guess for critical value
> > > scalar Gcrit = 10
> > > * start with G > Gcrit (otherwise loop will not begin)
> > > scalar G = Gcrit +1
> > > di "maxit = " `maxit'
> > > di "G= " G
> > > di "Gcrit = " Gcrit
> > >          while G > Gcrit & `i'<= `maxit' {
> > >                    sum `xvar' if tag_grubbs == 0
> > >                    local nobs = r(N)
> > >                    gen `dev' = (abs(`xvar' -r(mean)))/r(sd)
> > >                    gsort -`missx' -tag_grubbs `dev'
> > >                    scalar G = `dev'[_N]
> > >                    local ct = `conf'/(2*`nobs')
> > >                    local ts = invttail(`nobs'-2,`ct')
> > >                    scalar Gcrit = (`nobs'-1)*sqrt(`ts'^2/(`nobs'*(`nobs'-2+`ts'^2)))
> > >                    di "Iteration = " `i' "   Critical G = " Gcrit " Current G = " G
> > >                    if (G > Gcrit) di `xvar'[_N] " is an outlier, so tag_grubbs = 1"
> > >                    replace tag_grubbs = 1 if  `dev' == G & G > Gcrit
> > >                    local i = `i'+1
> > >                    drop `dev'
> > > }
> > >
> > > if (`i'<=`maxit') di "Grubbs procedure terminated: no more outliers"
> > > else di "Maximum iterations exceeded: Use larger maxit"
> > > end
> > > ___________________________________end of program__________________________________________
> > >
> > > Could anyone suggest me which program is better to use. I will appreciate if you please use auto data for variable price as an example to run these programs.
> > >
> > > Thanks.
> > >
> > > Badri Prasad
> > > Policy, Reporting and Data Development
> > > Labour Standards and Workplace Equity
> > > National Labour Operations Directorate
> > > HRSDC
> > > (819) 956 - 8146
> > >
> >
> > *
> > *   For searches and help try:
> > *   http://www.stata.com/support/faqs/res/findit.html
> > *   http://www.stata.com/support/statalist/faq
> > *   http://www.ats.ucla.edu/stat/stata/
> >
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index