Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 21:24:59 +0100

I don't think what happens in contrived simulations hits the main
methodological issue at all. As a geographer, some of the time, an
outlier to me is something like the Amazon which is big and different
and something that needs to be accommodated in the model.  That can be
done in many ways other than by discarding outliers. Once throwing
away awkward data is regarded as legitimate, when you do stop?
(Independent evidence that an outlier is untrustworthy, as in lab
records of experiments, is a different thing, although even there
there are well-known stories of discarding as a matter of prior
prejudice.)

To make the question as stark as possible, and to suppress large areas
of grey (gray): There are people who fit the data to the model and
people who fit models to the data. It may sound like the same thing,
but the attitude that one is so confident that the model is right that
you are happy to discard the most inconvenient data is not at all the
same as the attitude that the data can tell you something about the
inadequacies of the current model.

Nick

On Mon, Jun 6, 2011 at 5:57 PM, Austin Nichols <austinnichols@gmail.com> wrote:
> Nick et al. --
> Here is a simulation example demonstrating my claim.
>
> clear all
> prog simedetect, rclass
>  syntax [, n(int 1000) p(int 50) y me(int 1)]
>  drawnorm x1 x2 x3 e u1 u2 u3, n(`n') clear
>  g p=uniform()
>  g y=x1+x2+x3+e
>  replace x1=x1+`me'*u1 if p<(`p'/100)
>  replace x2=x2+`me'*u2 if p<(`p'/100)
>  replace x3=x3+`me'*u3 if p<(`p'/100)
>  reg y x1 x2 x3, r
>  foreach v in x1 x2 x3 {
>  ret scalar `v'=_b[`v']
>  ret scalar s`v'=_se[`v']
>  _pctile `v', nq(100)
>  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
>  }
>  foreach v of loc y {
>  _pctile `v', nq(100)
>  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
>  }
>  egen ux=rowtotal(lo_*)
>  reg y x1 x2 x3 if ux<1, r
>  foreach v in x1 x2 x3 {
>  ret scalar u`v'=_b[`v']
>  ret scalar su`v'=_se[`v']
>  qui su `v'
>  g t_`v'=((`v'-r(mean))/r(sd))^2
>  }
>  foreach v of loc y {
>  qui su `v'
>  g byte t_`v'=((`v'-r(mean))/r(sd))^2
>  }
>  egen d=rowtotal(t_*)
>  _pctile d, nq(100)
>  g byte mx=(d>r(r96))
>  reg y x1 x2 x3 if mx<1, r
>  foreach v in x1 x2 x3 {
>  ret scalar m`v'=_b[`v']
>  ret scalar sm`v'=_se[`v']
>  }
>  eret clear
>  end
>
> * draw 1000 datasets and try 2 trimming rules to drop extreme X
> simul,r(1000) seed(1) nodots:simedetect
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(t`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * univariate trim dominates
>
> * now try trimming based on outcome variable too
> simul,r(1000) seed(1) nodots:simedetect, y
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(y`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * now no trimming is clearly better; the trimming introduces bias
>
> * last try trimming based on outcome variable too w/o meas error
> simul,r(1000) seed(1) nodots:simedetect, y me(0)
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(z`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * no trimming is clearly better; the trimming introduces bias
>
>
>> On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <austinnichols@gmail.com> wrote:
>>> Nick--
>>> I think the advisability of trimming outliers depends on what is
>>> meant; restricting a regression to a range of X (explanatory
>>> variables) more plausibly free of measurement error by dropping cases
>>> with extreme values can improve estimates, both by reducing bias due
>>> to measurement error and providing much more accurate SEs; but doing
>>> the same to the outcome y will typically introduce bias even where
>>> there was none before--in general selecting on the outcome variable is
>>> demonstrably a terrible idea.
/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index