[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Carlo Lazzaro" <carlo.lazzaro@tin.it> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
R: st: highly skewed, highly zeroed data |

Date |
Wed, 25 Nov 2009 15:28:56 +0100 |

<Conversely, my instinct is that a gamma distribution as suggested by Carlo Lazzaro does not look quite right for that kind of distribution, unless it makes sense for the positive values only>. Admittedly, as Nick pointed out, Jason's dataset reports an impressive frequency of 0 observations (518 out of 647). My previous suggestion about Gamma distribution comes from my experience in dealing with health care related costs, whose distribution is usually right-skewed. The main reasons for these behaviour are the following (for two interesting references, please see: - Briggs, A. and Nixon, R. and Dixon, S. and Thompson, S. (2005). Parametric modelling of cost data: some simulation evidence. Health Economics 14(4):pp. 421-428; free downloadable at http://eprints.gla.ac.uk/4151/; - Briggs A, Sculpher M, Claxton K. Decision Modelling for Health Economic Evaluation. Oxford: Oxford University Press, 2006: 77-120): - some patients may drop dead just a little bit after being enrolled in a given arm of a given clinical trial. Hence, they accrue 0 costs; - on the contrary, some patients may accrue a lot of cost due to, say, adverse effects to a given therapy that are expensive to cure. However, and Nick's remarks highlights a very tricky (and still unresolved) issue), no one in our research field have ever issued (and I would assume that this task is quite impossible to be accomplished) a guidance about the quantitative meaning of "substantial proportion of zero observations" (Briggs A, Clarke P, Polsky D, Glick H. Modelling the cost of health care interventions. Paper prepared for DEEM III: Costing Methods for Economic Evaluation. University of Aberdeen, 15-16th April 2003) in cost distributions. Eventually, I do hope that I will never come across in a data set like this in my next economic evaluation of health care programmes!! Kind Regards, Carlo -----Messaggio originale----- Da: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di Nick Cox Inviato: mercoledì 25 novembre 2009 14.18 A: statalist@hsphsun2.harvard.edu Oggetto: RE: st: highly skewed, highly zeroed data Contrary to the original signal, this is one of the most frequently debated topics on this list, and for very good reason. What best to do with highly skewed data that seem to cry out for a log transformation except that in fact they include several zeros isn't obvious. A good answer will depend not only on what the data are like but also on what you know about the underlying process (which questioners typically do not spell out) and on what exactly you are trying to do _and_ why (on which there is on average a little more detail). Nudging the values to all positive is also a fudge and a kludge, although sometimes it does yield sensible results! I wish someone (else) would write a review paper or monograph about this, but I don't know of one. In addition to other comments, here are two, utterly standard except that they haven't appeared in this thread so far: 1. The sample data are so skewed that it isn't obvious that any kind of mean makes practical sense, although I agree strongly with Kieran McCaul that you can do it directly and take comfort from the central limit theorem. 2. Without process information, one wonders whether there is really a two-step process: some do, some don't; and those that do have times from a very skewed distribution. There are lots of paths to follow in that direction, perhaps most simply some zero-inflated distribution, although I imagine you'd have to program it yourself. Conversely, my instinct is that a gamma distribution as suggested by Carlo Lazzaro does not look quite right for that kind of distribution, unless it makes sense for the positive values only. Nick n.j.cox@durham.ac.uk Carlo Lazzaro Taking Maarten's wise remark forward, Jason (and whoever is interested in this tricky topic) might want to take a look at "old but gold": Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To Transform? National Bureau Of Economic Research, Technical Working Paper 246, 1999 (downloadable with some restrictions from http://www.nber.org/papers/T0246). Maarten buis --- On Wed, 25/11/09, Jason Ferris wrote: > I am aware of adding a constant and the transforming on the > log scale (with antilog) for interpretation. The previous comments are useful and to the point, all I can add is that this sugestion by the original poster will _not_ give you an estimate of the mean. Notice that the logarithm is a non-linear transformation, so taking a logarithm of a variable, computing a mean, and than backtransform that mean to the original metric will not give you the mean of the original variable. If you didn't add the constant you would have gotten geometric mean, but by adding the constant you'll just get a meaningless number. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**RE: st: highly skewed, highly zeroed data***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: Generating random numbers** - Next by Date:
**st: re: overid error** - Previous by thread:
**RE: st: highly skewed, highly zeroed data** - Next by thread:
**Re: st: highly skewed, highly zeroed data** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |