Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Trimming data and statistics


From   "Akihito Tokuhara" <akitoku4@hotmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Trimming data and statistics
Date   Fri, 14 Apr 2006 09:38:25 -0500

Dear Maarten,

Thank you very much for your reply which certainly makes it ore clearer to me about the meaning
of median. The fact is, when I test-trim the dataset, the median actually moves away from zero.

However, in light of your explanation, I would hesitate to use or interpret the new value of median,
and rather would look more into the context and probably place some other conditions on what
the objective of the analysis demands before re-calculate the statistics. Many thanks again.

Akihito

--------------------------------

From: Maarten buis <maartenbuis@yahoo.co.uk>
Reply-To: statalist@hsphsun2.harvard.edu
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Trimming data and statistics
Date: Fri, 14 Apr 2006 10:27:17 +0100 (BST)

Akihito:
Trimming can be useful when dealing with a statistic that is sensitive to outliers, e.g. the mean.
However, the median is already pretty robust against outliers. In your case zero is really the
50th percentile or middle observation. It doesn't matter whether the most negative value is
-.000001 or -10000000000000, the middle observation remains zero. This is however not true for the
mean. This is what we mean when we say that median is robust against outliers and the mean isn't.
Cutting of extreme values can thus be useful for the mean but is useless for the median. This is
especially true if you cut of equal numbers of cases of the top and the bottom end of the
distribution, in which case the middle observation remains unchanged, and I cannot think of a
justification for cutting off more or less cases of the bottom end of the distribution than the
top end.
HTH,
Maarten


--- Akihito Tokuhara <akitoku4@hotmail.com> wrote:
> I have a dataset of more than 100,000 obs, of which about 75,000 obs are
> zero, another 16,000 are missing, 1,500 are negative (which is not expected), the
> remained are positive.
>
> Because more than half of the sample is zero, the median is zero. I don't know
> if trimming the data would make sense to get the median different from zero.
> What I mean is whether or not that is not "torturing" data, or distorting it?


-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting adress:
Buitenveldertselaan 3 (Metropolitan), room Z214

+31 20 5986715

http://home.fsw.vu.nl/m.buis/
-----------------------------------------

Send instant messages to your online friends http://uk.messenger.yahoo.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar – get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index