Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extreme data points

From   Austin Nichols <>
Subject   Re: st: Extreme data points
Date   Wed, 8 Jun 2011 10:23:17 -0400

Achmed Aldai <>:
While that multivariate code works in the example with 2 vars, it
falls down when more than 2 variables are put in X; this is better
(bearing in mind that I cannot recommend the multivariate outlier
detection approach for use in any real data, given the simulation
evidence presented already):

clear all
sysuse nlsw88
loc X wage hours
foreach v of loc X {
_pctile `v', nq(200)
g byte lo_`v'=(`v'<r(r1)|`v'>r(r199))
egen ex=rowtotal(lo_*)
replace ex=1 if ex>1
la var ex "Excluded values of X (univariate)"
loc i 1
loc 1
qui foreach v of loc X {
if `i'==1 {
su `v'
g double t_`v'=(`v'-r(mean))/r(sd)
else {
reg `v' `1'
predict double t_`v', res
su t_`v'
replace t_`v'=(t_`v'-r(mean))/r(sd)
loc 1 `1' t_`v'
loc i=`i'+1
qui foreach v of loc X {
replace t_`v'=(t_`v')^2
egen double dm=rowtotal(t_*)
_pctile dm, nq(100)
g byte ex2=(dm>r(r99))
la var ex2 "Excluded values of X (multivariate)"
logit married wage hours if ex<1
logit married wage hours if ex2<1
sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
ta ex ex2

*but see also mahapick on SSC for a canned solution to calculating Mah. distance

On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <> wrote:
> Achmed Aldai <>:
> See
> and the rest of that thread; or read on for an improved multivariate
> exclusion algorithm (the prior iteration ignored possible correlations
> in X).
> The advisability of dropping extreme data points depends on what is
> meant and why this is wanted; restricting a regression to a range of X
> (explanatory variables) more plausibly free of measurement error by
> dropping cases with extreme values can improve estimates; but dropping
> extreme values of the outcome y will typically introduce bias even
> where there was none before. In general, selecting on the outcome
> variable is a terrible idea.
> If you want to restrict X, you can go variable by variable and drop
> the top half a percent, or look at all X together as an ellipsoidal
> cloud (i.e. using Mahalobis distance and excluding those obs with
> distance in the top one percent); probably the variable by variable
> approach is better (especially when data is not multivariate normal)
> but here is an example of both:
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index