Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extreme data points


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Extreme data points
Date   Wed, 8 Jun 2011 06:47:35 -0400

Achmed Aldai <Hauptseminar@gmx.de>:
See
http://www.stata.com/statalist/archive/2011-06/msg00240.html
and the rest of that thread; or read on for an improved multivariate
exclusion algorithm (the prior iteration ignored possible correlations
in X).

The advisability of dropping extreme data points depends on what is
meant and why this is wanted; restricting a regression to a range of X
(explanatory variables) more plausibly free of measurement error by
dropping cases with extreme values can improve estimates; but dropping
extreme values of the outcome y will typically introduce bias even
where there was none before. In general, selecting on the outcome
variable is a terrible idea.

If you want to restrict X, you can go variable by variable and drop
the top half a percent, or look at all X together as an ellipsoidal
cloud (i.e. using Mahalobis distance and excluding those obs with
distance in the top one percent); probably the variable by variable
approach is better (especially when data is not multivariate normal)
but here is an example of both:

clear all
sysuse nlsw88
loc X wage hours
foreach v of loc X {
_pctile `v', nq(200)
g byte lo_`v'=(`v'<r(r1))|(`v'>r(r199))
}
egen ex=rowtotal(lo_*)
replace ex=1 if ex>1
la var ex "Excluded values of X (univariate)"
loc i 1
qui foreach v of loc X {
if `i'==1 {
su `v'
g t_`v'=(`v'-r(mean))/r(sd)
loc 1 t_`v'
}
else {
reg `v' `1'
predict t_`v', res
su t_`v'
replace t_`v'=(t_`v'-r(mean))/r(sd)
}
loc i=`i'+1
}
qui foreach v of loc X {
replace t_`v'=(t_`v')^2
}
egen dm=rowtotal(t_*)
_pctile dm, nq(100)
g byte ex2=(dm>r(r99))
la var ex2 "Excluded values of X (multivariate)"
logit married wage hours if ex<1
logit married wage hours if ex2<1
sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
ta ex ex2

*Note the general approach of never modifying the original data by
dropping observations or replacing values with missing, but rather
making an exclusion dummy that can be used in graphs or regressions
downstream.

On Tue, Jun 7, 2011 at 6:56 AM, Achmed Aldai <Hauptseminar@gmx.de> wrote:
> how can I drop the top 0,5% and bottom 0,5% observations of a variable?
> I have to do this for several variables.
> It would be great if someone could give me an example.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index