Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Andreas Dall Frøseth <Andreas.Froseth@stud.nhh.no> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
SV: st: Removing outliers from my dataset |

Date |
Tue, 16 Apr 2013 14:11:35 +0000 |

Thank you for your feedback. I agree with the concerns you listed, and will certainly be aware of these during my work with analyzing possible problems with my dataset. However. Due to the fact that I am using accounting numbers as a approach to the real economic nature of a company, I will experience some difficulties. Accounting numbers are highly affected by accounting principles, which might give certain companies the incentives to portray a false economic situation to the public. Having this in mind, I have reviewed a large number of the companies, which in my opinion have a ROA with no root in economic sense, and found that this can only be due to faults in the accounting numbers. My aim is to find what constitutes the best basis for competitive advantage in the companies in my dataset. Defining competitive advantage as earning ROA above the industry mean. By using industry median rather than industry mean, one will differ from the definition, and my thesis will make no economic/theoretic sense. By calculating a industry mean on the basis of fault due to accounting principles, one might rush to the wrong conclusion. This being the case, I would like to have a systematic approach to exclude extreme numbers from my dataset, rather than excluding them manually. I'm aware that this approach might exclude more than those companies of my concern, but due to limitations in hours available, and a rather large dataset, I find it irrational to do this in another way. Is there a approach I could use that allows me to analyze the effect of excluding some of the companies, before I drop them from my set (ie creating a variable and using the if-command)? This would be helpful when documenting my work in my thesis. Kind regards Andreas ________________________________________ Fra: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] på vegne av Nick Cox [njcoxstata@gmail.com] Sendt: 16. april 2013 13:21 Til: statalist@hsphsun2.harvard.edu Emne: Re: st: Removing outliers from my dataset My input is that in my field dropping outliers like this would be very regarded as very poor practice. Dropping outliers that appear to be unrecoverable measurement errors could be justified; dropping outliers that are genuine extremes makes little sense. On these grounds you would discard the Amazon because on most criteria it's the world's biggest river (and not only that, really big). Even if your aim is to focus on what's typical you shouldn't just throw away 2% of the data because those fractions are awkward to analyse. That's a reflection on the competence of the analyst, not a concession to the world's complexities. Genuine alternatives include using something other than means to summarize (e.g. median, or something else) transforming your problematic variables non-identity link functions in -glm- robust or quantile regression A specific objection to outlier criteria based on mean and SD is that outliers tend to inflate mean and especially SD, so you may not even drop what you think is obviously deviant. Using median and median deviation is not quite so vulnerable, but can be problematic in different ways. It's not difficult to work out how to relate values to mean and SD, but I am all set on discouraging you from doing any such thing. Nick njcoxstata@gmail.com On 16 April 2013 12:07, Andreas Dall Frøseth <Andreas.Froseth@stud.nhh.no> wrote: > Dear statalisters. > > I'm still working with my dataset containing company data and accounting numbers for a large number of companies in the time periode 1992 to 2010. I've figured out the calculations in my last posting, but I'm currently experiencing some difficulties when trying to analyse the effect of outliers in my dataset. > > For each company in the set, I have calculated ROA and sales growth each year. The problem is however, due to weaknesses in my dataset, some of the companies are now listed with unreasonable values of ROA. I.e. one of the companies have a ROA of 6500 %. I have ensures myself that this is not due too a miscalculation, but rather due to accounting rules and newly listed companies. > I therefore wish to examine my dataset, and, if possible, get rid of outliers. My main idea is to calculate a mean industry ROA (the set contains a variable "industry", with a industry-indicator for each company. Ex. 4 = fishing), and setting a bottom and top limit for each industry. Do anyone have input on the most appropriate way to do this? It seems to be several different approaches I apply, ie 2 standard deviations from the mean or 1 and 99-percentile. > > I would also appreciate input on how to do this in STATA. Creating an industry mean, and a variable indicating wich companies to exclude from my population based on a recommended approach. > > All feedback will be appreciated. > > Kind regards > Andreas > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Removing outliers from my dataset***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Removing outliers from my dataset***From:*"JVerkuilen (Gmail)" <jvverkuilen@gmail.com>

**References**:**st: Removing outliers from my dataset***From:*Andreas Dall Frøseth <Andreas.Froseth@stud.nhh.no>

**Re: st: Removing outliers from my dataset***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**RE: st: RE: Re: xtmixed with log-transfered dependent variable: back to non-log on margins and marginsplot** - Next by Date:
**st: RE: Re: Snowball sampling** - Previous by thread:
**Re: st: Removing outliers from my dataset** - Next by thread:
**Re: st: Removing outliers from my dataset** - Index(es):