Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extreme data points

From	Austin Nichols <[email protected]>
To	[email protected]
Subject	Re: st: Extreme data points
Date	Wed, 8 Jun 2011 06:47:35 -0400

Achmed Aldai <[email protected]>:
See
http://www.stata.com/statalist/archive/2011-06/msg00240.html
and the rest of that thread; or read on for an improved multivariate
exclusion algorithm (the prior iteration ignored possible correlations
in X).

The advisability of dropping extreme data points depends on what is
meant and why this is wanted; restricting a regression to a range of X
(explanatory variables) more plausibly free of measurement error by
dropping cases with extreme values can improve estimates; but dropping
extreme values of the outcome y will typically introduce bias even
where there was none before. In general, selecting on the outcome
variable is a terrible idea.

If you want to restrict X, you can go variable by variable and drop
the top half a percent, or look at all X together as an ellipsoidal
cloud (i.e. using Mahalobis distance and excluding those obs with
distance in the top one percent); probably the variable by variable
approach is better (especially when data is not multivariate normal)
but here is an example of both:

clear all
sysuse nlsw88
loc X wage hours
foreach v of loc X {
_pctile `v', nq(200)
g byte lo_`v'=(`v'<r(r1))|(`v'>r(r199))
}
egen ex=rowtotal(lo_*)
replace ex=1 if ex>1
la var ex "Excluded values of X (univariate)"
loc i 1
qui foreach v of loc X {
if `i'==1 {
su `v'
g t_`v'=(`v'-r(mean))/r(sd)
loc 1 t_`v'
}
else {
reg `v' `1'
predict t_`v', res
su t_`v'
replace t_`v'=(t_`v'-r(mean))/r(sd)
}
loc i=`i'+1
}
qui foreach v of loc X {
replace t_`v'=(t_`v')^2
}
egen dm=rowtotal(t_*)
_pctile dm, nq(100)
g byte ex2=(dm>r(r99))
la var ex2 "Excluded values of X (multivariate)"
logit married wage hours if ex<1
logit married wage hours if ex2<1
sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
ta ex ex2

*Note the general approach of never modifying the original data by
dropping observations or replacing values with missing, but rather
making an exclusion dummy that can be used in graphs or regressions
downstream.

On Tue, Jun 7, 2011 at 6:56 AM, Achmed Aldai <[email protected]> wrote:
> how can I drop the top 0,5% and bottom 0,5% observations of a variable?
> I have to do this for several variables.
> It would be great if someone could give me an example.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Extreme data points
  - From: Austin Nichols <[email protected]>

References:
- st: Extreme data points
  - From: "Achmed Aldai" <[email protected]>

Prev by Date: Re: st: hosmer lemshow goodness of fit statistics
Next by Date: st: Leamer's Extreme Bound Analysis and Fixed Effect
Previous by thread: st: RE: Extreme data points
Next by thread: Re: st: Extreme data points
Index(es):
- Date
- Thread