 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Extreme data points

 From Jorge Eduardo Pérez Pérez To "statalist@hsphsun2.harvard.edu" Subject Re: st: Extreme data points Date Wed, 8 Jun 2011 11:47:09 -0400

```You might also want to take a look at multivariate outlier detection
methods in Stata: -hadimvo- and -bacon-
_______________________
Jorge Eduardo Pérez Pérez

On Wed, Jun 8, 2011 at 10:23 AM, Austin Nichols <austinnichols@gmail.com> wrote:
> Achmed Aldai <Hauptseminar@gmx.de>:
> While that multivariate code works in the example with 2 vars, it
> falls down when more than 2 variables are put in X; this is better
> (bearing in mind that I cannot recommend the multivariate outlier
> detection approach for use in any real data, given the simulation
>
> clear all
> sysuse nlsw88
> loc X wage hours
> foreach v of loc X {
> _pctile `v', nq(200)
> g byte lo_`v'=(`v'<r(r1)|`v'>r(r199))
> }
> egen ex=rowtotal(lo_*)
> replace ex=1 if ex>1
> la var ex "Excluded values of X (univariate)"
> loc i 1
> loc 1
> qui foreach v of loc X {
> if `i'==1 {
> su `v'
> g double t_`v'=(`v'-r(mean))/r(sd)
> }
> else {
> reg `v' `1'
> predict double t_`v', res
> su t_`v'
> replace t_`v'=(t_`v'-r(mean))/r(sd)
> }
> loc 1 `1' t_`v'
> loc i=`i'+1
> }
> qui foreach v of loc X {
> replace t_`v'=(t_`v')^2
> }
> egen double dm=rowtotal(t_*)
> _pctile dm, nq(100)
> g byte ex2=(dm>r(r99))
> la var ex2 "Excluded values of X (multivariate)"
> logit married wage hours if ex<1
> logit married wage hours if ex2<1
> sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
> sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
> ta ex ex2
>
> *but see also mahapick on SSC for a canned solution to calculating Mah. distance
>
> On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <austinnichols@gmail.com> wrote:
>> Achmed Aldai <Hauptseminar@gmx.de>:
>> See
>> http://www.stata.com/statalist/archive/2011-06/msg00240.html
>> and the rest of that thread; or read on for an improved multivariate
>> exclusion algorithm (the prior iteration ignored possible correlations
>> in X).
>>
>> The advisability of dropping extreme data points depends on what is
>> meant and why this is wanted; restricting a regression to a range of X
>> (explanatory variables) more plausibly free of measurement error by
>> dropping cases with extreme values can improve estimates; but dropping
>> extreme values of the outcome y will typically introduce bias even
>> where there was none before. In general, selecting on the outcome
>> variable is a terrible idea.
>>
>> If you want to restrict X, you can go variable by variable and drop
>> the top half a percent, or look at all X together as an ellipsoidal
>> cloud (i.e. using Mahalobis distance and excluding those obs with
>> distance in the top one percent); probably the variable by variable
>> approach is better (especially when data is not multivariate normal)
>> but here is an example of both:
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```