[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: highly skewed, highly zeroed data

From   "Lachenbruch, Peter" <>
To   <>
Subject   RE: st: highly skewed, highly zeroed data
Date   Wed, 25 Nov 2009 08:30:07 -0800

Nick's comments come close to what my thoughts were when I read the
initial post.  With 518 0s in the data, it looks as if something about
the process is causing a probability mass at 0.  In my original work on
this problem, I was motivated by cell growth on agar plates.  Some
plates had no growth, others had some, so I modeled using two parts:  a
test of the proportion with no growth and rank sum test for those that
showed growth and combined the two tests.  About 15 years later, a
student had a problem of modeling hospitalization costs in which 95% of
the people had no costs - so a model of 0 and model of non-zero worked
nicely.  In this case, the zeros were identifiable.  In some cases the
zeros are a mixture of structural zeros (can't be anything else) and
sampling zeros, so a mixture model like zip or zinb is needed.
There is a nice review issue of Statistical Methods in Medical Research
in 2002 on this topic that I edited.
My own bias is that the mean is not a good measure because of the heavy
fraction of zeros - no transformation will remove this clump.  I would
describe the data by the fraction of zeros and the mean or median of the
non-zeros.  A confidence interval can only mislead in this situation.


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
[] On Behalf Of Nick Cox
Sent: Wednesday, November 25, 2009 5:18 AM
Subject: RE: st: highly skewed, highly zeroed data

Contrary to the original signal, this is one of the most frequently
debated topics on this list, and for very good reason. What best to do
with highly skewed data that seem to cry out for a log transformation
except that in fact they include several zeros isn't obvious. A good
answer will depend not only on what the data are like but also on what
you know about the underlying process (which questioners typically do
not spell out) and on what exactly you are trying to do _and_ why (on
which there is on average a little more detail). Nudging the values to
all positive is also a fudge and a kludge, although sometimes it does
yield sensible results! 

I wish someone (else) would write a review paper or monograph about
this, but I don't know of one. 

In addition to other comments, here are two, utterly standard except
that they haven't appeared in this thread so far: 

1. The sample data are so skewed that it isn't obvious that any kind of
mean makes practical sense, although I agree strongly with Kieran McCaul
that you can do it directly and take comfort from the central limit

2. Without process information, one wonders whether there is really a
two-step process: some do, some don't; and those that do have times from
a very skewed distribution. There are lots of paths to follow in that
direction, perhaps most simply some zero-inflated distribution, although
I imagine you'd have to program it yourself. Conversely, my instinct is
that a gamma distribution as suggested by Carlo Lazzaro does not look
quite right for that kind of distribution, unless it makes sense for the
positive values only. 


Carlo Lazzaro

Taking Maarten's wise remark forward, Jason (and whoever is interested
this tricky topic) might want to take a look at "old but gold": 
Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To
Transform? National Bureau Of Economic Research, Technical Working Paper
246, 1999 (downloadable with some restrictions from

Maarten buis

--- On Wed, 25/11/09, Jason Ferris wrote:
> I am aware of adding a constant and the transforming on the
> log scale (with antilog) for interpretation.

The previous comments are useful and to the point, all I can
add is that this sugestion by the original poster will _not_ 
give you an estimate of the mean. Notice that the logarithm 
is a non-linear transformation, so taking a logarithm of a 
variable, computing a mean, and than backtransform that mean 
to the original metric will not give you the mean of the 
original variable. If you didn't add the constant you would 
have gotten geometric mean, but by adding the constant you'll 
just get a meaningless number.

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2023 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index