Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Jorge Eduardo Pérez Pérez <perez.jorge@ur.edu.co> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Extreme data points |

Date |
Wed, 8 Jun 2011 11:47:09 -0400 |

You might also want to take a look at multivariate outlier detection methods in Stata: -hadimvo- and -bacon- _______________________ Jorge Eduardo Pérez Pérez On Wed, Jun 8, 2011 at 10:23 AM, Austin Nichols <austinnichols@gmail.com> wrote: > Achmed Aldai <Hauptseminar@gmx.de>: > While that multivariate code works in the example with 2 vars, it > falls down when more than 2 variables are put in X; this is better > (bearing in mind that I cannot recommend the multivariate outlier > detection approach for use in any real data, given the simulation > evidence presented already): > > clear all > sysuse nlsw88 > loc X wage hours > foreach v of loc X { > _pctile `v', nq(200) > g byte lo_`v'=(`v'<r(r1)|`v'>r(r199)) > } > egen ex=rowtotal(lo_*) > replace ex=1 if ex>1 > la var ex "Excluded values of X (univariate)" > loc i 1 > loc 1 > qui foreach v of loc X { > if `i'==1 { > su `v' > g double t_`v'=(`v'-r(mean))/r(sd) > } > else { > reg `v' `1' > predict double t_`v', res > su t_`v' > replace t_`v'=(t_`v'-r(mean))/r(sd) > } > loc 1 `1' t_`v' > loc i=`i'+1 > } > qui foreach v of loc X { > replace t_`v'=(t_`v')^2 > } > egen double dm=rowtotal(t_*) > _pctile dm, nq(100) > g byte ex2=(dm>r(r99)) > la var ex2 "Excluded values of X (multivariate)" > logit married wage hours if ex<1 > logit married wage hours if ex2<1 > sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u) > sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)")) > ta ex ex2 > > *but see also mahapick on SSC for a canned solution to calculating Mah. distance > > On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <austinnichols@gmail.com> wrote: >> Achmed Aldai <Hauptseminar@gmx.de>: >> See >> http://www.stata.com/statalist/archive/2011-06/msg00240.html >> and the rest of that thread; or read on for an improved multivariate >> exclusion algorithm (the prior iteration ignored possible correlations >> in X). >> >> The advisability of dropping extreme data points depends on what is >> meant and why this is wanted; restricting a regression to a range of X >> (explanatory variables) more plausibly free of measurement error by >> dropping cases with extreme values can improve estimates; but dropping >> extreme values of the outcome y will typically introduce bias even >> where there was none before. In general, selecting on the outcome >> variable is a terrible idea. >> >> If you want to restrict X, you can go variable by variable and drop >> the top half a percent, or look at all X together as an ellipsoidal >> cloud (i.e. using Mahalobis distance and excluding those obs with >> distance in the top one percent); probably the variable by variable >> approach is better (especially when data is not multivariate normal) >> but here is an example of both: > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Extreme data points***From:*Austin Nichols <austinnichols@gmail.com>

**References**:**st: Extreme data points***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**Re: st: Extreme data points***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Extreme data points***From:*Austin Nichols <austinnichols@gmail.com>

- Prev by Date:
**Re: st: .dta storage, why is too big?** - Next by Date:
**st: Scatter and line graphs with by option** - Previous by thread:
**Re: st: Extreme data points** - Next by thread:
**Re: st: Extreme data points** - Index(es):