Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SV: st: Removing outliers from my dataset

From   Andreas Dall Frøseth <>
To   "" <>
Subject   SV: st: Removing outliers from my dataset
Date   Tue, 16 Apr 2013 14:11:35 +0000

Thank you for your feedback. I agree with the concerns you listed, and will certainly be aware of these during my work with analyzing possible problems with my dataset. 

However. Due to the fact that I am using accounting numbers as a approach to the real economic nature of a company, I will experience some difficulties. Accounting numbers are highly affected by accounting principles, which might give certain companies the incentives to portray a false economic situation to the public. Having this in mind, I have reviewed a large number of the companies, which in my opinion have a ROA with no root in economic sense, and found that this can only be due to faults in the accounting numbers. My aim is to find what constitutes the best basis for competitive advantage in the companies in my dataset. Defining competitive advantage as earning ROA above the industry mean. By using industry median rather than industry mean, one will differ from the definition, and my thesis will make no economic/theoretic sense. By calculating a industry mean on the basis of fault due to accounting principles, one might rush to the wrong conclusion. 

This being the case, I would like to have a systematic approach to exclude extreme numbers from my dataset, rather than excluding them manually. I'm aware that this approach might exclude more than those companies of my concern, but due to limitations in hours available, and a rather large dataset, I find it irrational to do this in another way. 

Is there a approach I could use that allows me to analyze the effect of excluding some of the companies, before I drop them from my set (ie creating a variable and using the if-command)? This would be helpful when documenting my work in my thesis.

Kind regards
Fra: [] p&#229; vegne av Nick Cox []
Sendt: 16. april 2013 13:21
Emne: Re: st: Removing outliers from my dataset

My input is that in my field dropping outliers like this would be very
regarded as very poor practice. Dropping outliers that appear to be
unrecoverable measurement errors could be justified; dropping outliers
that are genuine extremes makes little sense. On these grounds you
would discard the Amazon because on most criteria it's the world's
biggest river (and not only that, really big).

Even if your aim is to focus on what's typical you shouldn't just
throw away 2% of the data because those fractions are awkward to
analyse. That's a reflection on the competence of the analyst, not a
concession to the world's complexities.

Genuine alternatives include

using something other than means to summarize (e.g. median, or something else)

transforming your problematic variables

non-identity link functions in -glm-

robust or quantile regression

A specific objection to outlier criteria based on mean and SD is that
outliers tend to inflate mean and especially SD, so you may not even
drop what you think is obviously deviant. Using median and median
deviation is not quite so vulnerable, but can be problematic in
different ways.

It's not difficult to work out how to relate values to mean and SD,
but I am all set on discouraging you from doing any such thing.


On 16 April 2013 12:07, Andreas Dall Frøseth
<> wrote:
> Dear statalisters.
> I'm still working with my dataset containing company data and accounting numbers for a large number of companies in the time periode 1992 to 2010. I've figured out the calculations in my last posting, but I'm currently experiencing some difficulties when trying to analyse the effect of outliers in my dataset.
> For each company in the set, I have calculated ROA and sales growth each year. The problem is however, due to weaknesses in my dataset, some of the companies are now listed with unreasonable values of ROA. I.e. one of the companies have a ROA of 6500 %. I have ensures myself that this is not due too a miscalculation, but rather due to accounting rules and newly listed companies.
> I therefore wish to examine my dataset, and, if possible, get rid of outliers. My main idea is to calculate a mean industry ROA (the set contains a variable "industry", with a industry-indicator for each company. Ex. 4 = fishing), and setting a bottom and top limit for each industry. Do anyone have input on the most appropriate way to do this? It seems to be several different approaches I apply, ie 2 standard deviations from the mean or 1 and 99-percentile.
> I would also appreciate input on how to do this in STATA. Creating an industry mean, and a variable indicating wich companies to exclude from my population based on a recommended approach.
> All feedback will be appreciated.
> Kind regards
> Andreas
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index