# st: Trimming data and statistics

 From "Akihito Tokuhara" To statalist@hsphsun2.harvard.edu Subject st: Trimming data and statistics Date Thu, 13 Apr 2006 22:09:22 -0500

Dear Statalist members,

I have a dataset of more than 100,000 obs, of which about 75,000 obs are zero,
another 16,000 are missing, 1,500 are negative (which is not expected), the
remained are positive.

Because more than half of the sample is zero, the median is zero. I don't know
if trimming the data would make sense to get the median different from zero.
What I mean is whether or not that is not "torturing" data, or distorting it ?

My colleague suggests one way of trimming by replacing all values less than
percentile p5 by that p5 percentile, and any value larger than p90 by the
p90. Thus the number of observation does not change.

My way if to cut off completely all obs having values less than p5 and all
those having values bigger than p90. This way will reduce the number of observations.
My argument is that, the cut-off data is a sort of "outliers" which can be
ignored, although that might be wrong under different assumptions.

Regarding the question above,

(Q-1) Is trimming data is a good way to do it ?

(Q-2) If trimming is done, does the statistics lose its meaning ?

In particular, suppose in another dataset of similar size in which 75 %
has a particular value of a (real number), which might not be very likely,
then using the same approach of trimming above may result in a median at
different value than a.

Thanks in advance and I would appreciate very much for any advice/criticism.
Perhaps this is a very basic question in statistics, but forgive my ignorance.

Akihito

_________________________________________________________________
