[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: highly skewed, highly zeroed data |

Date |
Wed, 25 Nov 2009 08:30:07 -0800 |

Nick's comments come close to what my thoughts were when I read the initial post. With 518 0s in the data, it looks as if something about the process is causing a probability mass at 0. In my original work on this problem, I was motivated by cell growth on agar plates. Some plates had no growth, others had some, so I modeled using two parts: a test of the proportion with no growth and rank sum test for those that showed growth and combined the two tests. About 15 years later, a student had a problem of modeling hospitalization costs in which 95% of the people had no costs - so a model of 0 and model of non-zero worked nicely. In this case, the zeros were identifiable. In some cases the zeros are a mixture of structural zeros (can't be anything else) and sampling zeros, so a mixture model like zip or zinb is needed. There is a nice review issue of Statistical Methods in Medical Research in 2002 on this topic that I edited. My own bias is that the mean is not a good measure because of the heavy fraction of zeros - no transformation will remove this clump. I would describe the data by the fraction of zeros and the mean or median of the non-zeros. A confidence interval can only mislead in this situation. Tony Peter A. Lachenbruch Department of Public Health Oregon State University Corvallis, OR 97330 Phone: 541-737-3832 FAX: 541-737-4001 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox Sent: Wednesday, November 25, 2009 5:18 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: highly skewed, highly zeroed data Contrary to the original signal, this is one of the most frequently debated topics on this list, and for very good reason. What best to do with highly skewed data that seem to cry out for a log transformation except that in fact they include several zeros isn't obvious. A good answer will depend not only on what the data are like but also on what you know about the underlying process (which questioners typically do not spell out) and on what exactly you are trying to do _and_ why (on which there is on average a little more detail). Nudging the values to all positive is also a fudge and a kludge, although sometimes it does yield sensible results! I wish someone (else) would write a review paper or monograph about this, but I don't know of one. In addition to other comments, here are two, utterly standard except that they haven't appeared in this thread so far: 1. The sample data are so skewed that it isn't obvious that any kind of mean makes practical sense, although I agree strongly with Kieran McCaul that you can do it directly and take comfort from the central limit theorem. 2. Without process information, one wonders whether there is really a two-step process: some do, some don't; and those that do have times from a very skewed distribution. There are lots of paths to follow in that direction, perhaps most simply some zero-inflated distribution, although I imagine you'd have to program it yourself. Conversely, my instinct is that a gamma distribution as suggested by Carlo Lazzaro does not look quite right for that kind of distribution, unless it makes sense for the positive values only. Nick n.j.cox@durham.ac.uk Carlo Lazzaro Taking Maarten's wise remark forward, Jason (and whoever is interested in this tricky topic) might want to take a look at "old but gold": Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To Transform? National Bureau Of Economic Research, Technical Working Paper 246, 1999 (downloadable with some restrictions from http://www.nber.org/papers/T0246). Maarten buis --- On Wed, 25/11/09, Jason Ferris wrote: > I am aware of adding a constant and the transforming on the > log scale (with antilog) for interpretation. The previous comments are useful and to the point, all I can add is that this sugestion by the original poster will _not_ give you an estimate of the mean. Notice that the logarithm is a non-linear transformation, so taking a logarithm of a variable, computing a mean, and than backtransform that mean to the original metric will not give you the mean of the original variable. If you didn't add the constant you would have gotten geometric mean, but by adding the constant you'll just get a meaningless number. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: highly skewed, highly zeroed data***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**R: st: highly skewed, highly zeroed data***From:*"Carlo Lazzaro" <carlo.lazzaro@tin.it>

**RE: st: highly skewed, highly zeroed data***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: display all categories on pie chart for categorical variables (with some zero values)** - Next by Date:
**st: Multilevel SEM using GLLAMM** - Previous by thread:
**RE: st: highly skewed, highly zeroed data** - Next by thread:
**st: Postestimation puzzle(s)** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |