Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Elimination of outliers |

Date |
Mon, 6 Jun 2011 21:24:59 +0100 |

I don't think what happens in contrived simulations hits the main methodological issue at all. As a geographer, some of the time, an outlier to me is something like the Amazon which is big and different and something that needs to be accommodated in the model. That can be done in many ways other than by discarding outliers. Once throwing away awkward data is regarded as legitimate, when you do stop? (Independent evidence that an outlier is untrustworthy, as in lab records of experiments, is a different thing, although even there there are well-known stories of discarding as a matter of prior prejudice.) To make the question as stark as possible, and to suppress large areas of grey (gray): There are people who fit the data to the model and people who fit models to the data. It may sound like the same thing, but the attitude that one is so confident that the model is right that you are happy to discard the most inconvenient data is not at all the same as the attitude that the data can tell you something about the inadequacies of the current model. Nick On Mon, Jun 6, 2011 at 5:57 PM, Austin Nichols <austinnichols@gmail.com> wrote: > Nick et al. -- > Here is a simulation example demonstrating my claim. > > clear all > prog simedetect, rclass > syntax [, n(int 1000) p(int 50) y me(int 1)] > drawnorm x1 x2 x3 e u1 u2 u3, n(`n') clear > g p=uniform() > g y=x1+x2+x3+e > replace x1=x1+`me'*u1 if p<(`p'/100) > replace x2=x2+`me'*u2 if p<(`p'/100) > replace x3=x3+`me'*u3 if p<(`p'/100) > reg y x1 x2 x3, r > foreach v in x1 x2 x3 { > ret scalar `v'=_b[`v'] > ret scalar s`v'=_se[`v'] > _pctile `v', nq(100) > g byte lo_`v'=(`v'<r(r2)|`v'>r(r98)) > } > foreach v of loc y { > _pctile `v', nq(100) > g byte lo_`v'=(`v'<r(r2)|`v'>r(r98)) > } > egen ux=rowtotal(lo_*) > reg y x1 x2 x3 if ux<1, r > foreach v in x1 x2 x3 { > ret scalar u`v'=_b[`v'] > ret scalar su`v'=_se[`v'] > qui su `v' > g t_`v'=((`v'-r(mean))/r(sd))^2 > } > foreach v of loc y { > qui su `v' > g byte t_`v'=((`v'-r(mean))/r(sd))^2 > } > egen d=rowtotal(t_*) > _pctile d, nq(100) > g byte mx=(d>r(r96)) > reg y x1 x2 x3 if mx<1, r > foreach v in x1 x2 x3 { > ret scalar m`v'=_b[`v'] > ret scalar sm`v'=_se[`v'] > } > eret clear > end > > * draw 1000 datasets and try 2 trimming rules to drop extreme X > simul,r(1000) seed(1) nodots:simedetect > su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3) > su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3) > loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar)) > foreach v in x1 x2 x3 { > tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(t`v') > g mse_`v'=(`v'-1)^2 > g mse_u`v'=(u`v'-1)^2 > g mse_m`v'=(m`v'-1)^2 > } > su mse*, sep(3) > * univariate trim dominates > > * now try trimming based on outcome variable too > simul,r(1000) seed(1) nodots:simedetect, y > su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3) > su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3) > loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar)) > foreach v in x1 x2 x3 { > tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(y`v') > g mse_`v'=(`v'-1)^2 > g mse_u`v'=(u`v'-1)^2 > g mse_m`v'=(m`v'-1)^2 > } > su mse*, sep(3) > * now no trimming is clearly better; the trimming introduces bias > > * last try trimming based on outcome variable too w/o meas error > simul,r(1000) seed(1) nodots:simedetect, y me(0) > su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3) > su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3) > loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar)) > foreach v in x1 x2 x3 { > tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(z`v') > g mse_`v'=(`v'-1)^2 > g mse_u`v'=(u`v'-1)^2 > g mse_m`v'=(m`v'-1)^2 > } > su mse*, sep(3) > * no trimming is clearly better; the trimming introduces bias > > >> On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <austinnichols@gmail.com> wrote: >>> Nick-- >>> I think the advisability of trimming outliers depends on what is >>> meant; restricting a regression to a range of X (explanatory >>> variables) more plausibly free of measurement error by dropping cases >>> with extreme values can improve estimates, both by reducing bias due >>> to measurement error and providing much more accurate SEs; but doing >>> the same to the outcome y will typically introduce bias even where >>> there was none before--in general selecting on the outcome variable is >>> demonstrably a terrible idea. / * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Elimination of outliers***From:*Ronan Conroy <rconroy@rcsi.ie>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**References**:**st: Elimination of outliers***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**Re: st: Elimination of outliers***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Elimination of outliers***From:*"Achmed Aldai" <Hauptseminar@gmx.de>

**RE: st: Elimination of outliers***From:*Nick Cox <n.j.cox@durham.ac.uk>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Elimination of outliers***From:*Austin Nichols <austinnichols@gmail.com>

- Prev by Date:
**st: SQL problem: ID already defined** - Next by Date:
**Re: st: Elimination of outliers** - Previous by thread:
**Re: st: Elimination of outliers** - Next by thread:
**Re: st: Elimination of outliers** - Index(es):