[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: highly skewed, highly zeroed data |

Date |
Wed, 25 Nov 2009 15:43:22 -0000 |

I should perhaps add a very simple point but one that is sometimes overlooked. Even in situations with such a high skew that analysts might feel that a mean is dubious it can still make sense because of its link to the total. Thus, these data are times of something in hours: the total time could still be a useful thing to know in terms of the total support required for patients, or whatever it is. A similar point presumably applies to e.g. cost data as discussed by Carlo Lazzaro. Nick n.j.cox@durham.ac.uk -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox Sent: 25 November 2009 13:18 To: statalist@hsphsun2.harvard.edu Subject: RE: st: highly skewed, highly zeroed data Contrary to the original signal, this is one of the most frequently debated topics on this list, and for very good reason. What best to do with highly skewed data that seem to cry out for a log transformation except that in fact they include several zeros isn't obvious. A good answer will depend not only on what the data are like but also on what you know about the underlying process (which questioners typically do not spell out) and on what exactly you are trying to do _and_ why (on which there is on average a little more detail). Nudging the values to all positive is also a fudge and a kludge, although sometimes it does yield sensible results! I wish someone (else) would write a review paper or monograph about this, but I don't know of one. In addition to other comments, here are two, utterly standard except that they haven't appeared in this thread so far: 1. The sample data are so skewed that it isn't obvious that any kind of mean makes practical sense, although I agree strongly with Kieran McCaul that you can do it directly and take comfort from the central limit theorem. 2. Without process information, one wonders whether there is really a two-step process: some do, some don't; and those that do have times from a very skewed distribution. There are lots of paths to follow in that direction, perhaps most simply some zero-inflated distribution, although I imagine you'd have to program it yourself. Conversely, my instinct is that a gamma distribution as suggested by Carlo Lazzaro does not look quite right for that kind of distribution, unless it makes sense for the positive values only. Nick n.j.cox@durham.ac.uk Carlo Lazzaro Taking Maarten's wise remark forward, Jason (and whoever is interested in this tricky topic) might want to take a look at "old but gold": Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To Transform? National Bureau Of Economic Research, Technical Working Paper 246, 1999 (downloadable with some restrictions from http://www.nber.org/papers/T0246). Maarten buis --- On Wed, 25/11/09, Jason Ferris wrote: > I am aware of adding a constant and the transforming on the > log scale (with antilog) for interpretation. The previous comments are useful and to the point, all I can add is that this sugestion by the original poster will _not_ give you an estimate of the mean. Notice that the logarithm is a non-linear transformation, so taking a logarithm of a variable, computing a mean, and than backtransform that mean to the original metric will not give you the mean of the original variable. If you didn't add the constant you would have gotten geometric mean, but by adding the constant you'll just get a meaningless number. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: highly skewed, highly zeroed data***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**R: st: highly skewed, highly zeroed data***From:*"Carlo Lazzaro" <carlo.lazzaro@tin.it>

**RE: st: highly skewed, highly zeroed data***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**RE: st: How to label bars with frequency AND percentage for categorical variables?** - Next by Date:
**Re: st: How to label bars with frequency AND percentage for categorical variables?** - Previous by thread:
**st: How to include exogenous variables in VECM?** - Next by thread:
**RE: st: highly skewed, highly zeroed data** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |