Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Extreme data points

 From Austin Nichols <[email protected]> To [email protected] Subject Re: st: Extreme data points Date Wed, 8 Jun 2011 06:47:35 -0400

```Achmed Aldai <[email protected]>:
See
http://www.stata.com/statalist/archive/2011-06/msg00240.html
and the rest of that thread; or read on for an improved multivariate
exclusion algorithm (the prior iteration ignored possible correlations
in X).

The advisability of dropping extreme data points depends on what is
meant and why this is wanted; restricting a regression to a range of X
(explanatory variables) more plausibly free of measurement error by
dropping cases with extreme values can improve estimates; but dropping
extreme values of the outcome y will typically introduce bias even
where there was none before. In general, selecting on the outcome
variable is a terrible idea.

If you want to restrict X, you can go variable by variable and drop
the top half a percent, or look at all X together as an ellipsoidal
cloud (i.e. using Mahalobis distance and excluding those obs with
distance in the top one percent); probably the variable by variable
approach is better (especially when data is not multivariate normal)
but here is an example of both:

clear all
sysuse nlsw88
loc X wage hours
foreach v of loc X {
_pctile `v', nq(200)
g byte lo_`v'=(`v'<r(r1))|(`v'>r(r199))
}
egen ex=rowtotal(lo_*)
replace ex=1 if ex>1
la var ex "Excluded values of X (univariate)"
loc i 1
qui foreach v of loc X {
if `i'==1 {
su `v'
g t_`v'=(`v'-r(mean))/r(sd)
loc 1 t_`v'
}
else {
reg `v' `1'
predict t_`v', res
su t_`v'
replace t_`v'=(t_`v'-r(mean))/r(sd)
}
loc i=`i'+1
}
qui foreach v of loc X {
replace t_`v'=(t_`v')^2
}
egen dm=rowtotal(t_*)
_pctile dm, nq(100)
g byte ex2=(dm>r(r99))
la var ex2 "Excluded values of X (multivariate)"
logit married wage hours if ex<1
logit married wage hours if ex2<1
sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
ta ex ex2

*Note the general approach of never modifying the original data by
dropping observations or replacing values with missing, but rather
making an exclusion dummy that can be used in graphs or regressions
downstream.

On Tue, Jun 7, 2011 at 6:56 AM, Achmed Aldai <[email protected]> wrote:
> how can I drop the top 0,5% and bottom 0,5% observations of a variable?
> I have to do this for several variables.
> It would be great if someone could give me an example.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```