Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <[email protected]> |

To |
[email protected] |

Subject |
Re: st: Extreme data points |

Date |
Wed, 8 Jun 2011 06:47:35 -0400 |

Achmed Aldai <[email protected]>: See http://www.stata.com/statalist/archive/2011-06/msg00240.html and the rest of that thread; or read on for an improved multivariate exclusion algorithm (the prior iteration ignored possible correlations in X). The advisability of dropping extreme data points depends on what is meant and why this is wanted; restricting a regression to a range of X (explanatory variables) more plausibly free of measurement error by dropping cases with extreme values can improve estimates; but dropping extreme values of the outcome y will typically introduce bias even where there was none before. In general, selecting on the outcome variable is a terrible idea. If you want to restrict X, you can go variable by variable and drop the top half a percent, or look at all X together as an ellipsoidal cloud (i.e. using Mahalobis distance and excluding those obs with distance in the top one percent); probably the variable by variable approach is better (especially when data is not multivariate normal) but here is an example of both: clear all sysuse nlsw88 loc X wage hours foreach v of loc X { _pctile `v', nq(200) g byte lo_`v'=(`v'<r(r1))|(`v'>r(r199)) } egen ex=rowtotal(lo_*) replace ex=1 if ex>1 la var ex "Excluded values of X (univariate)" loc i 1 qui foreach v of loc X { if `i'==1 { su `v' g t_`v'=(`v'-r(mean))/r(sd) loc 1 t_`v' } else { reg `v' `1' predict t_`v', res su t_`v' replace t_`v'=(t_`v'-r(mean))/r(sd) } loc i=`i'+1 } qui foreach v of loc X { replace t_`v'=(t_`v')^2 } egen dm=rowtotal(t_*) _pctile dm, nq(100) g byte ex2=(dm>r(r99)) la var ex2 "Excluded values of X (multivariate)" logit married wage hours if ex<1 logit married wage hours if ex2<1 sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u) sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)")) ta ex ex2 *Note the general approach of never modifying the original data by dropping observations or replacing values with missing, but rather making an exclusion dummy that can be used in graphs or regressions downstream. On Tue, Jun 7, 2011 at 6:56 AM, Achmed Aldai <[email protected]> wrote: > how can I drop the top 0,5% and bottom 0,5% observations of a variable? > I have to do this for several variables. > It would be great if someone could give me an example. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Extreme data points***From:*Austin Nichols <[email protected]>

**References**:**st: Extreme data points***From:*"Achmed Aldai" <[email protected]>

- Prev by Date:
**Re: st: hosmer lemshow goodness of fit statistics** - Next by Date:
**st: Leamer's Extreme Bound Analysis and Fixed Effect** - Previous by thread:
**st: RE: Extreme data points** - Next by thread:
**Re: st: Extreme data points** - Index(es):