Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Extreme data points |

Date |
Wed, 8 Jun 2011 10:23:17 -0400 |

Achmed Aldai <Hauptseminar@gmx.de>: While that multivariate code works in the example with 2 vars, it falls down when more than 2 variables are put in X; this is better (bearing in mind that I cannot recommend the multivariate outlier detection approach for use in any real data, given the simulation evidence presented already): clear all sysuse nlsw88 loc X wage hours foreach v of loc X { _pctile `v', nq(200) g byte lo_`v'=(`v'<r(r1)|`v'>r(r199)) } egen ex=rowtotal(lo_*) replace ex=1 if ex>1 la var ex "Excluded values of X (univariate)" loc i 1 loc 1 qui foreach v of loc X { if `i'==1 { su `v' g double t_`v'=(`v'-r(mean))/r(sd) } else { reg `v' `1' predict double t_`v', res su t_`v' replace t_`v'=(t_`v'-r(mean))/r(sd) } loc 1 `1' t_`v' loc i=`i'+1 } qui foreach v of loc X { replace t_`v'=(t_`v')^2 } egen double dm=rowtotal(t_*) _pctile dm, nq(100) g byte ex2=(dm>r(r99)) la var ex2 "Excluded values of X (multivariate)" logit married wage hours if ex<1 logit married wage hours if ex2<1 sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u) sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)")) ta ex ex2 *but see also mahapick on SSC for a canned solution to calculating Mah. distance On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <austinnichols@gmail.com> wrote: > Achmed Aldai <Hauptseminar@gmx.de>: > See > http://www.stata.com/statalist/archive/2011-06/msg00240.html > and the rest of that thread; or read on for an improved multivariate > exclusion algorithm (the prior iteration ignored possible correlations > in X). > > The advisability of dropping extreme data points depends on what is > meant and why this is wanted; restricting a regression to a range of X > (explanatory variables) more plausibly free of measurement error by > dropping cases with extreme values can improve estimates; but dropping > extreme values of the outcome y will typically introduce bias even > where there was none before. In general, selecting on the outcome > variable is a terrible idea. > > If you want to restrict X, you can go variable by variable and drop > the top half a percent, or look at all X together as an ellipsoidal > cloud (i.e. using Mahalobis distance and excluding those obs with > distance in the top one percent); probably the variable by variable > approach is better (especially when data is not multivariate normal) > but here is an example of both: * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Extreme data points***From:*Jorge Eduardo Pérez Pérez <perez.jorge@ur.edu.co>

**References**:**st: Extreme data points***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**Re: st: Extreme data points***From:*Austin Nichols <austinnichols@gmail.com>

- Prev by Date:
**Re: st: String Variable** - Next by Date:
**Re: st: .dta storage, why is too big?** - Previous by thread:
**Re: st: Extreme data points** - Next by thread:
**Re: st: Extreme data points** - Index(es):