Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: How to detect outliers

 From Xixi Lin To statalist@hsphsun2.harvard.edu Subject Re: st: How to detect outliers Date Mon, 11 Feb 2013 14:37:56 -0500

```Hi Nick,

You are absolutely right! I messed up the obs numbers, it should be
obs in each period instead. And After I fix that, the results from
these two methods are pretty close.

Thanks again. You are so helpful! ^_^

Best,
Xixi Lin

On Mon, Feb 11, 2013 at 2:24 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> I wouldn't regard any kind of large residual as indicating outliers
> unequivocally. On the contrary, a really marked outlier is likely to
> pull the regression towards it, with the result of a small residual.
>
> Your criterion here for Cook is 4/n, but evidently you are fitting
> regressions separately for each period. The total dataset size of
> 165779 is not pertinent to regressions fitted individually. The
> relevant criterion is the number of observations used in each
> regression.
>
> I think you'd learn more from residual vs fitted plots, even all 119 of them.
>
> Whether you would be better off with a different model depends on your
> research problem.
>
> Nick
>
> On Mon, Feb 11, 2013 at 6:50 PM, Xixi Lin <winnielxx@gmail.com> wrote:
>> Hi,
>> I tried two ways to detect outliers: one is to regard Cook’s Distance
>> greater than 4/n as outliers; the other is  to regard those with
>> standardized residuals greater than 2 in magnitude as outliers. Here
>> is the my code:
>>
>> gen residual=.
>> tempvar temp
>>    foreach z of numlist 2/120 {
>>       capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>>       if !_rc {
>>         predict temp,rstu
>>         replace residual=temp if Period==`z'
>>         drop temp
>>       }
>>    }
>>
>> //cook's distance
>> gen di_bench=4/165979
>> gen distance=.
>> tempvar temp1
>> foreach z of numlist 2/120 {
>>       capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>>       if !_rc {
>>         predict temp1,cook
>>         replace distance=temp1 if Period==`z'
>>         drop temp1
>>       }
>>    }
>> //outlier numbers
>> count if abs(residual) > 2    // 7922
>> count if distance > di_bench     //111879
>>
>> My question is did I mess up the codes?  Why the two results are so
>> different? one shows 7922 outliers, the other shows 111879 outliers.
>> If I compare Cook's Distance with 1, then the outlier number is 133.
>>
>> Can anyone tells me which method I should choose? Or is there any
>> other better ways to detect outliers? Thanks a lot.
>>
>> Best,
>> Xixi Lin
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```