Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Elimination of outliers
Date	Mon, 6 Jun 2011 21:24:59 +0100

I don't think what happens in contrived simulations hits the main
methodological issue at all. As a geographer, some of the time, an
outlier to me is something like the Amazon which is big and different
and something that needs to be accommodated in the model.  That can be
done in many ways other than by discarding outliers. Once throwing
away awkward data is regarded as legitimate, when you do stop?
(Independent evidence that an outlier is untrustworthy, as in lab
records of experiments, is a different thing, although even there
there are well-known stories of discarding as a matter of prior
prejudice.)

To make the question as stark as possible, and to suppress large areas
of grey (gray): There are people who fit the data to the model and
people who fit models to the data. It may sound like the same thing,
but the attitude that one is so confident that the model is right that
you are happy to discard the most inconvenient data is not at all the
same as the attitude that the data can tell you something about the
inadequacies of the current model.

Nick

On Mon, Jun 6, 2011 at 5:57 PM, Austin Nichols <[email protected]> wrote:
> Nick et al. --
> Here is a simulation example demonstrating my claim.
>
> clear all
> prog simedetect, rclass
>  syntax [, n(int 1000) p(int 50) y me(int 1)]
>  drawnorm x1 x2 x3 e u1 u2 u3, n(`n') clear
>  g p=uniform()
>  g y=x1+x2+x3+e
>  replace x1=x1+`me'*u1 if p<(`p'/100)
>  replace x2=x2+`me'*u2 if p<(`p'/100)
>  replace x3=x3+`me'*u3 if p<(`p'/100)
>  reg y x1 x2 x3, r
>  foreach v in x1 x2 x3 {
>  ret scalar `v'=_b[`v']
>  ret scalar s`v'=_se[`v']
>  _pctile `v', nq(100)
>  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
>  }
>  foreach v of loc y {
>  _pctile `v', nq(100)
>  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
>  }
>  egen ux=rowtotal(lo_*)
>  reg y x1 x2 x3 if ux<1, r
>  foreach v in x1 x2 x3 {
>  ret scalar u`v'=_b[`v']
>  ret scalar su`v'=_se[`v']
>  qui su `v'
>  g t_`v'=((`v'-r(mean))/r(sd))^2
>  }
>  foreach v of loc y {
>  qui su `v'
>  g byte t_`v'=((`v'-r(mean))/r(sd))^2
>  }
>  egen d=rowtotal(t_*)
>  _pctile d, nq(100)
>  g byte mx=(d>r(r96))
>  reg y x1 x2 x3 if mx<1, r
>  foreach v in x1 x2 x3 {
>  ret scalar m`v'=_b[`v']
>  ret scalar sm`v'=_se[`v']
>  }
>  eret clear
>  end
>
> * draw 1000 datasets and try 2 trimming rules to drop extreme X
> simul,r(1000) seed(1) nodots:simedetect
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(t`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * univariate trim dominates
>
> * now try trimming based on outcome variable too
> simul,r(1000) seed(1) nodots:simedetect, y
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(y`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * now no trimming is clearly better; the trimming introduces bias
>
> * last try trimming based on outcome variable too w/o meas error
> simul,r(1000) seed(1) nodots:simedetect, y me(0)
> su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
> su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
> loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
> foreach v in x1 x2 x3 {
>  tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(z`v')
>  g mse_`v'=(`v'-1)^2
>  g mse_u`v'=(u`v'-1)^2
>  g mse_m`v'=(m`v'-1)^2
>  }
> su mse*, sep(3)
> * no trimming is clearly better; the trimming introduces bias
>
>
>> On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <[email protected]> wrote:
>>> Nick--
>>> I think the advisability of trimming outliers depends on what is
>>> meant; restricting a regression to a range of X (explanatory
>>> variables) more plausibly free of measurement error by dropping cases
>>> with extreme values can improve estimates, both by reducing bias due
>>> to measurement error and providing much more accurate SEs; but doing
>>> the same to the outcome y will typically introduce bias even where
>>> there was none before--in general selecting on the outcome variable is
>>> demonstrably a terrible idea.
/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Elimination of outliers
  - From: Ronan Conroy <[email protected]>
- Re: st: Elimination of outliers
  - From: Austin Nichols <[email protected]>

References:
- st: Elimination of outliers
  - From: "Achmed Aldai" <[email protected]>
- Re: st: Elimination of outliers
  - From: Nick Cox <[email protected]>
- Re: st: Elimination of outliers
  - From: "Achmed Aldai" <[email protected]>
- RE: st: Elimination of outliers
  - From: Nick Cox <[email protected]>
- Re: st: Elimination of outliers
  - From: Austin Nichols <[email protected]>
- Re: st: Elimination of outliers
  - From: Austin Nichols <[email protected]>
- Re: st: Elimination of outliers
  - From: Austin Nichols <[email protected]>

Prev by Date: st: SQL problem: ID already defined
Next by Date: Re: st: Elimination of outliers
Previous by thread: Re: st: Elimination of outliers
Next by thread: Re: st: Elimination of outliers
Index(es):
- Date
- Thread