Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: Re: RE: Re: RE: RE: IQR


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: Re: RE: Re: RE: RE: IQR
Date   Thu, 7 Jun 2007 20:25:44 +0100

Sure, there is a -winsor- ado which I wrote on SSC 
and, according to Kit Baum's reports, it is quite heavily 
used. I have never used it myself, bar in development. 

I cannot recall the details, but perhaps someone 
wrote into Statalist reporting that it seemed that
Stata did not support Winsorizing and that was a black 
mark against Stata. To which the best reply was a 
program, being concrete evidence that you can easily do 
Winsorizing in Stata and here is one way to do it. 

But let us look at the wider picture. There is no 
one way to deal with outliers. There are many ways 
to deal with outliers, including 

1. Going out "into the field" and doing the measurement 
again. 

2. Testing whether they are genuine. Most of the
tests look pretty contrived to me, but you might find one
that you can believe fits your situation. Irrational 
faith that a test is appropriate is always needed
to apply a test that is then presented as quintessentially
rational. 

3. Throwing them out as a matter of judgement, i.e. 
in Stata terms -drop-ping them from the data. 

4. Throwing them out using some more-or-less 
automated (usually not "objective") rule.  

5. Ignoring them, along the lines of either 3 or 4. 
This could be formal (e.g. trimming) or just leaving 
them in the dataset, but omitting them from analyses
as too hot to handle. 

6. Pulling them in using some kind of adjustment, 
e.g. Winsorizing. 

7. Downplaying them by using some other robust estimation
method. 

8. Downplaying them by working on a transformed 
scale. 

9. Downplaying them by using a non-identical link 
function. 

10. Accommodating them by fitting some appropriate
fat-, long-, or heavy-tailed distribution, without
or with predictors. 

11. Sidestepping the issue by using some non-parametric
(e.g. rank-based) procedure. 

12. Getting a handle on the implied uncertainty 
using bootstrapping, jackknifing or permutation-based
procedure. 

13. Editing to replace an outlier with some more
likely value, based on deterministic logic. "An 18-
year-old grandmother is unlikely, but the person 
in question was born in 1926, so presumably is
really 81." 

14. Editing to replace an impossible or implausible 
outlier using some imputation method that is currently
acceptable not-quite-white magic. 

15. Analysing with and without, and seeing how much 
difference the outlier(s) make(s), statistically, 
scientifically or practically. 

16. Something Bayesian. My prior ignorance of quite
what forbids from giving any details. 

Naturally, these categories intergrade in some 
cases, and I can believe I have forgotten
or am not aware of yet other approaches. 

What is quite striking to me -- as with many 
any areas of statistical science -- is how much 
preferred solutions vary between investigator
and discipline, despite the broad similarity
of the problems that outliers pose. 

Nick 
n.j.cox@durham.ac.uk 

Rajesh Tharyan
 
> Isn't there a winsor ado (written by nick) which can be used 
> to deal with
> outliers? In some cases it may be preferable to throwing out the
> observations?

Rodrigo A. Alfaro
 
> It seems to be a 'common' practice when COMPUSTAT
> data is used. The dataset is composed by the balance sheet
> reports of US firms. It would be difficult to identify in the 
> data mergers, splits or any sort of change in property that 
> implies a huge change in the composicion of a firm (in terms 
> of assets, fixed capital, etc.) then dropping extreme values 
> in change in assets allows you to 'delete' the unexplained 
> firms. Also, a similar problem affects the price where 
> sometime a change in the dividend policy can produce a 
> jump that makes sense only when the researcher knows 
> the change in policy. Usually, researchers do not know 
> about these policies or it is a titatic (and maybe useless) 
> job trying to include them in the analysis.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2020 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index