Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: How to detect outliers


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: How to detect outliers
Date   Mon, 11 Feb 2013 19:24:40 +0000

I wouldn't regard any kind of large residual as indicating outliers
unequivocally. On the contrary, a really marked outlier is likely to
pull the regression towards it, with the result of a small residual.

Your criterion here for Cook is 4/n, but evidently you are fitting
regressions separately for each period. The total dataset size of
165779 is not pertinent to regressions fitted individually. The
relevant criterion is the number of observations used in each
regression.

I think you'd learn more from residual vs fitted plots, even all 119 of them.

Whether you would be better off with a different model depends on your
research problem.

Nick

On Mon, Feb 11, 2013 at 6:50 PM, Xixi Lin <winnielxx@gmail.com> wrote:
> Hi,
> I tried two ways to detect outliers: one is to regard Cook’s Distance
> greater than 4/n as outliers; the other is  to regard those with
> standardized residuals greater than 2 in magnitude as outliers. Here
> is the my code:
>
> gen residual=.
> tempvar temp
>    foreach z of numlist 2/120 {
>       capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>       if !_rc {
>         predict temp,rstu
>         replace residual=temp if Period==`z'
>         drop temp
>       }
>    }
>
> //cook's distance
> gen di_bench=4/165979
> gen distance=.
> tempvar temp1
> foreach z of numlist 2/120 {
>       capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>       if !_rc {
>         predict temp1,cook
>         replace distance=temp1 if Period==`z'
>         drop temp1
>       }
>    }
> //outlier numbers
> count if abs(residual) > 2    // 7922
> count if distance > di_bench     //111879
>
> My question is did I mess up the codes?  Why the two results are so
> different? one shows 7922 outliers, the other shows 111879 outliers.
> If I compare Cook's Distance with 1, then the outlier number is 133.
>
> Can anyone tells me which method I should choose? Or is there any
> other better ways to detect outliers? Thanks a lot.
>
> Best,
> Xixi Lin
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index