Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Elimination of outliers


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Elimination of outliers
Date   Mon, 6 Jun 2011 12:57:17 -0400

Nick et al. --
Here is a simulation example demonstrating my claim.

clear all
prog simedetect, rclass
 syntax [, n(int 1000) p(int 50) y me(int 1)]
 drawnorm x1 x2 x3 e u1 u2 u3, n(`n') clear
 g p=uniform()
 g y=x1+x2+x3+e
 replace x1=x1+`me'*u1 if p<(`p'/100)
 replace x2=x2+`me'*u2 if p<(`p'/100)
 replace x3=x3+`me'*u3 if p<(`p'/100)
 reg y x1 x2 x3, r
 foreach v in x1 x2 x3 {
  ret scalar `v'=_b[`v']
  ret scalar s`v'=_se[`v']
  _pctile `v', nq(100)
  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
  }
 foreach v of loc y {
  _pctile `v', nq(100)
  g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
  }
 egen ux=rowtotal(lo_*)
 reg y x1 x2 x3 if ux<1, r
 foreach v in x1 x2 x3 {
  ret scalar u`v'=_b[`v']
  ret scalar su`v'=_se[`v']
  qui su `v'
  g t_`v'=((`v'-r(mean))/r(sd))^2
  }
 foreach v of loc y {
  qui su `v'
  g byte t_`v'=((`v'-r(mean))/r(sd))^2
  }
 egen d=rowtotal(t_*)
 _pctile d, nq(100)
 g byte mx=(d>r(r96))
 reg y x1 x2 x3 if mx<1, r
 foreach v in x1 x2 x3 {
  ret scalar m`v'=_b[`v']
  ret scalar sm`v'=_se[`v']
  }
 eret clear
 end

* draw 1000 datasets and try 2 trimming rules to drop extreme X
simul,r(1000) seed(1) nodots:simedetect
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
 tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(t`v')
 g mse_`v'=(`v'-1)^2
 g mse_u`v'=(u`v'-1)^2
 g mse_m`v'=(m`v'-1)^2
 }
su mse*, sep(3)
* univariate trim dominates

* now try trimming based on outcome variable too
simul,r(1000) seed(1) nodots:simedetect, y
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
 tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(y`v')
 g mse_`v'=(`v'-1)^2
 g mse_u`v'=(u`v'-1)^2
 g mse_m`v'=(m`v'-1)^2
 }
su mse*, sep(3)
* now no trimming is clearly better; the trimming introduces bias

* last try trimming based on outcome variable too w/o meas error
simul,r(1000) seed(1) nodots:simedetect, y me(0)
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
 tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(z`v')
 g mse_`v'=(`v'-1)^2
 g mse_u`v'=(u`v'-1)^2
 g mse_m`v'=(m`v'-1)^2
 }
su mse*, sep(3)
* no trimming is clearly better; the trimming introduces bias


> On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <austinnichols@gmail.com> wrote:
>> Nick--
>> I think the advisability of trimming outliers depends on what is
>> meant; restricting a regression to a range of X (explanatory
>> variables) more plausibly free of measurement error by dropping cases
>> with extreme values can improve estimates, both by reducing bias due
>> to measurement error and providing much more accurate SEs; but doing
>> the same to the outcome y will typically introduce bias even where
>> there was none before--in general selecting on the outcome variable is
>> demonstrably a terrible idea.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index