Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extreme data points

From	Austin Nichols <[email protected]>
To	[email protected]
Subject	Re: st: Extreme data points
Date	Wed, 8 Jun 2011 10:23:17 -0400

Achmed Aldai <[email protected]>:
While that multivariate code works in the example with 2 vars, it
falls down when more than 2 variables are put in X; this is better
(bearing in mind that I cannot recommend the multivariate outlier
detection approach for use in any real data, given the simulation
evidence presented already):

clear all
sysuse nlsw88
loc X wage hours
foreach v of loc X {
_pctile `v', nq(200)
g byte lo_`v'=(`v'<r(r1)|`v'>r(r199))
}
egen ex=rowtotal(lo_*)
replace ex=1 if ex>1
la var ex "Excluded values of X (univariate)"
loc i 1
loc 1
qui foreach v of loc X {
if `i'==1 {
su `v'
g double t_`v'=(`v'-r(mean))/r(sd)
}
else {
reg `v' `1'
predict double t_`v', res
su t_`v'
replace t_`v'=(t_`v'-r(mean))/r(sd)
}
loc 1 `1' t_`v'
loc i=`i'+1
}
qui foreach v of loc X {
replace t_`v'=(t_`v')^2
}
egen double dm=rowtotal(t_*)
_pctile dm, nq(100)
g byte ex2=(dm>r(r99))
la var ex2 "Excluded values of X (multivariate)"
logit married wage hours if ex<1
logit married wage hours if ex2<1
sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
ta ex ex2

*but see also mahapick on SSC for a canned solution to calculating Mah. distance

On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <[email protected]> wrote:
> Achmed Aldai <[email protected]>:
> See
> http://www.stata.com/statalist/archive/2011-06/msg00240.html
> and the rest of that thread; or read on for an improved multivariate
> exclusion algorithm (the prior iteration ignored possible correlations
> in X).
>
> The advisability of dropping extreme data points depends on what is
> meant and why this is wanted; restricting a regression to a range of X
> (explanatory variables) more plausibly free of measurement error by
> dropping cases with extreme values can improve estimates; but dropping
> extreme values of the outcome y will typically introduce bias even
> where there was none before. In general, selecting on the outcome
> variable is a terrible idea.
>
> If you want to restrict X, you can go variable by variable and drop
> the top half a percent, or look at all X together as an ellipsoidal
> cloud (i.e. using Mahalobis distance and excluding those obs with
> distance in the top one percent); probably the variable by variable
> approach is better (especially when data is not multivariate normal)
> but here is an example of both:
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Extreme data points
  - From: Jorge Eduardo Pérez Pérez <[email protected]>

References:
- st: Extreme data points
  - From: "Achmed Aldai" <[email protected]>
- Re: st: Extreme data points
  - From: Austin Nichols <[email protected]>

Prev by Date: Re: st: String Variable
Next by Date: Re: st: .dta storage, why is too big?
Previous by thread: Re: st: Extreme data points
Next by thread: Re: st: Extreme data points
Index(es):
- Date
- Thread