Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Elimination of outliers

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Elimination of outliers Date Mon, 6 Jun 2011 12:57:17 -0400

```Nick et al. --
Here is a simulation example demonstrating my claim.

clear all
prog simedetect, rclass
syntax [, n(int 1000) p(int 50) y me(int 1)]
drawnorm x1 x2 x3 e u1 u2 u3, n(`n') clear
g p=uniform()
g y=x1+x2+x3+e
replace x1=x1+`me'*u1 if p<(`p'/100)
replace x2=x2+`me'*u2 if p<(`p'/100)
replace x3=x3+`me'*u3 if p<(`p'/100)
reg y x1 x2 x3, r
foreach v in x1 x2 x3 {
ret scalar `v'=_b[`v']
ret scalar s`v'=_se[`v']
_pctile `v', nq(100)
g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
}
foreach v of loc y {
_pctile `v', nq(100)
g byte lo_`v'=(`v'<r(r2)|`v'>r(r98))
}
egen ux=rowtotal(lo_*)
reg y x1 x2 x3 if ux<1, r
foreach v in x1 x2 x3 {
ret scalar u`v'=_b[`v']
ret scalar su`v'=_se[`v']
qui su `v'
g t_`v'=((`v'-r(mean))/r(sd))^2
}
foreach v of loc y {
qui su `v'
g byte t_`v'=((`v'-r(mean))/r(sd))^2
}
egen d=rowtotal(t_*)
_pctile d, nq(100)
g byte mx=(d>r(r96))
reg y x1 x2 x3 if mx<1, r
foreach v in x1 x2 x3 {
ret scalar m`v'=_b[`v']
ret scalar sm`v'=_se[`v']
}
eret clear
end

* draw 1000 datasets and try 2 trimming rules to drop extreme X
simul,r(1000) seed(1) nodots:simedetect
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(t`v')
g mse_`v'=(`v'-1)^2
g mse_u`v'=(u`v'-1)^2
g mse_m`v'=(m`v'-1)^2
}
su mse*, sep(3)
* univariate trim dominates

* now try trimming based on outcome variable too
simul,r(1000) seed(1) nodots:simedetect, y
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(y`v')
g mse_`v'=(`v'-1)^2
g mse_u`v'=(u`v'-1)^2
g mse_m`v'=(m`v'-1)^2
}
su mse*, sep(3)
* now no trimming is clearly better; the trimming introduces bias

* last try trimming based on outcome variable too w/o meas error
simul,r(1000) seed(1) nodots:simedetect, y me(0)
su x1 ux1 mx1 x2 ux2 mx2 x3 ux3 mx3, sep(3)
su sx1 sux1 smx1 sx2 sux2 smx2 sx3 sux3 smx3, sep(3)
loc o xli(1) xla(0 1) leg(lab(1 OLS) lab(2 Univ Trim) lab(3 Multivar))
foreach v in x1 x2 x3 {
tw kdensity `v'||kdensity u`v'||kdensity m`v', `o' name(z`v')
g mse_`v'=(`v'-1)^2
g mse_u`v'=(u`v'-1)^2
g mse_m`v'=(m`v'-1)^2
}
su mse*, sep(3)
* no trimming is clearly better; the trimming introduces bias

> On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <austinnichols@gmail.com> wrote:
>> Nick--
>> I think the advisability of trimming outliers depends on what is
>> meant; restricting a regression to a range of X (explanatory
>> variables) more plausibly free of measurement error by dropping cases
>> with extreme values can improve estimates, both by reducing bias due
>> to measurement error and providing much more accurate SEs; but doing
>> the same to the outcome y will typically introduce bias even where
>> there was none before--in general selecting on the outcome variable is
>> demonstrably a terrible idea.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```