Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Removing outliers from my dataset
From 
 
Nick Cox <[email protected]> 
To 
 
"[email protected]" <[email protected]> 
Subject 
 
Re: st: Removing outliers from my dataset 
Date 
 
Tue, 16 Apr 2013 15:30:41 +0100 
The bottom line here is that you have to make your own decisions; many
of us feel free to give advice meanwhile.
The story here seems to differ from your previous story, in which you
seemed concerned about several values that were genuine but extreme;
now you are implying that such values are bogus as well as extreme.
What made you change your mind?
The idea that only means make theoretical sense evokes only a rude
retort from me. Even if use of means is standard practice in your
field I'd expect worthwhile research to challenge dogma and be
innovative. Yours could be a thesis that underlined that means are
unreliable in the presence of outliers, although that's a statistical
commonplace too.
If you don't believe me, you should believe an economist. Here is John
Maynard Keynes, perhaps better known for other work, explaining that
different kinds of means make sense depending on what is being done:
Keynes, J.M. 1911. The Principal Averages and the Laws of Error Which
Lead to Them.
Journal of the Royal Statistical Society 74:322–31.
I tend to draw the line at giving people Stata code for things I think
are a bad idea, as I've said already. That said, I recently posted
-trimmean- and -trimplot- on SSC as an implementation of trimmed
means. Those don't involve dropping outliers, but just discarding them
in computing averages. Both programs in my files are _much_ enhanced
now over those on SSC, but I am not ready for a second release.
Your timetable problems don't affect my advice.
Nick
[email protected]
On 16 April 2013 15:11, Andreas Dall Frøseth
<[email protected]> wrote:
> Thank you for your feedback. I agree with the concerns you listed, and will certainly be aware of these during my work with analyzing possible problems with my dataset.
>
> However. Due to the fact that I am using accounting numbers as a approach to the real economic nature of a company, I will experience some difficulties. Accounting numbers are highly affected by accounting principles, which might give certain companies the incentives to portray a false economic situation to the public. Having this in mind, I have reviewed a large number of the companies, which in my opinion have a ROA with no root in economic sense, and found that this can only be due to faults in the accounting numbers. My aim is to find what constitutes the best basis for competitive advantage in the companies in my dataset. Defining competitive advantage as earning ROA above the industry mean. By using industry median rather than industry mean, one will differ from the definition, and my thesis will make no economic/theoretic sense. By calculating a industry mean on the basis of fault due to accounting principles, one might rush to the wrong conclusion.
>
> This being the case, I would like to have a systematic approach to exclude extreme numbers from my dataset, rather than excluding them manually. I'm aware that this approach might exclude more than those companies of my concern, but due to limitations in hours available, and a rather large dataset, I find it irrational to do this in another way.
>
> Is there a approach I could use that allows me to analyze the effect of excluding some of the companies, before I drop them from my set (ie creating a variable and using the if-command)? This would be helpful when documenting my work in my thesis.
>
> Kind regards
> Andreas
> ________________________________________
> Fra: [email protected] [[email protected]] på vegne av Nick Cox [[email protected]]
> Sendt: 16. april 2013 13:21
> Til: [email protected]
> Emne: Re: st: Removing outliers from my dataset
>
> My input is that in my field dropping outliers like this would be very
> regarded as very poor practice. Dropping outliers that appear to be
> unrecoverable measurement errors could be justified; dropping outliers
> that are genuine extremes makes little sense. On these grounds you
> would discard the Amazon because on most criteria it's the world's
> biggest river (and not only that, really big).
>
> Even if your aim is to focus on what's typical you shouldn't just
> throw away 2% of the data because those fractions are awkward to
> analyse. That's a reflection on the competence of the analyst, not a
> concession to the world's complexities.
>
> Genuine alternatives include
>
> using something other than means to summarize (e.g. median, or something else)
>
> transforming your problematic variables
>
> non-identity link functions in -glm-
>
> robust or quantile regression
>
> A specific objection to outlier criteria based on mean and SD is that
> outliers tend to inflate mean and especially SD, so you may not even
> drop what you think is obviously deviant. Using median and median
> deviation is not quite so vulnerable, but can be problematic in
> different ways.
>
> It's not difficult to work out how to relate values to mean and SD,
> but I am all set on discouraging you from doing any such thing.
>
> Nick
> [email protected]
>
>
> On 16 April 2013 12:07, Andreas Dall Frøseth
> <[email protected]> wrote:
>> Dear statalisters.
>>
>> I'm still working with my dataset containing company data and accounting numbers for a large number of companies in the time periode 1992 to 2010. I've figured out the calculations in my last posting, but I'm currently experiencing some difficulties when trying to analyse the effect of outliers in my dataset.
>>
>> For each company in the set, I have calculated ROA and sales growth each year. The problem is however, due to weaknesses in my dataset, some of the companies are now listed with unreasonable values of ROA. I.e. one of the companies have a ROA of 6500 %. I have ensures myself that this is not due too a miscalculation, but rather due to accounting rules and newly listed companies.
>> I therefore wish to examine my dataset, and, if possible, get rid of outliers. My main idea is to calculate a mean industry ROA (the set contains a variable "industry", with a industry-indicator for each company. Ex. 4 = fishing), and setting a bottom and top limit for each industry. Do anyone have input on the most appropriate way to do this? It seems to be several different approaches I apply, ie 2 standard deviations from the mean or 1 and 99-percentile.
>>
>> I would also appreciate input on how to do this in STATA. Creating an industry mean, and a variable indicating wich companies to exclude from my population based on a recommended approach.
>>
>> All feedback will be appreciated.
>>
>> Kind regards
>> Andreas
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/