[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: highly skewed, highly zeroed data |

Date |
Wed, 25 Nov 2009 10:38:40 -0500 |

Jason Ferris et al. -- That the variable is named time suggests a kind of censoring that may need to be modeled; perhaps the poster can clarify the nature of the data. If this is hours spent volunteering last week, no modeling is required to describe the distribution; if it is months unemployed, describing the central tendency including zeros may no longer make sense (and if zeros represent several kinds of data, e.g. employed and not in labor force, you have various other problems to confront), but you also cannot simply summarize the positive values. In general, tranforming, calculating the mean and CI, and transforming the mean/CI is not a good way to get measures of the central tendency, but that general rule is made to be broken (but not by addding a constant before logging). In this case, giving the overall mean, the proportion positive, and the mean of the positive values seems a natural starting point. You may prefer to report the median of positive values instead of, or in addition to, the mean (the mean and median of log positive values are quite close, but the mean of positive values is about 8 and the median about 4). Unfortunately, the median in survey data does not have a nice sampling distribution, but you can use the ugly kludge below to get an imperfect CI. clear input time wt 0 518 .25 2 .5 3 1 15 1.5 1 2 23 3 10 3.5 1 4 11 5 13 6 9 7 3 8 19 20 10 45 9 end g pos=time>0 expand wt drop wt g w=ceil(uniform()*10) g c=mod(_n,25) svyset c [pw=w] svy:mean time svy:proportion pos svy,subpop(pos):mean time cap prog drop mysvymed prog mysvymed qreg `1' [aw=w] end preserve keep if time>0 bs, cl(c): mysvymed time restore On Wed, Nov 25, 2009 at 8:17 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > Contrary to the original signal, this is one of the most frequently > debated topics on this list, and for very good reason. What best to do > with highly skewed data that seem to cry out for a log transformation > except that in fact they include several zeros isn't obvious. A good > answer will depend not only on what the data are like but also on what > you know about the underlying process (which questioners typically do > not spell out) and on what exactly you are trying to do _and_ why (on > which there is on average a little more detail). Nudging the values to > all positive is also a fudge and a kludge, although sometimes it does > yield sensible results! > > I wish someone (else) would write a review paper or monograph about > this, but I don't know of one. > > In addition to other comments, here are two, utterly standard except > that they haven't appeared in this thread so far: > > 1. The sample data are so skewed that it isn't obvious that any kind of > mean makes practical sense, although I agree strongly with Kieran McCaul > that you can do it directly and take comfort from the central limit > theorem. > > 2. Without process information, one wonders whether there is really a > two-step process: some do, some don't; and those that do have times from > a very skewed distribution. There are lots of paths to follow in that > direction, perhaps most simply some zero-inflated distribution, although > I imagine you'd have to program it yourself. Conversely, my instinct is > that a gamma distribution as suggested by Carlo Lazzaro does not look > quite right for that kind of distribution, unless it makes sense for the > positive values only. > > Nick > n.j.cox@durham.ac.uk > > Carlo Lazzaro > > Taking Maarten's wise remark forward, Jason (and whoever is interested > in > this tricky topic) might want to take a look at "old but gold": > Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To > Transform? National Bureau Of Economic Research, Technical Working Paper > 246, 1999 (downloadable with some restrictions from > http://www.nber.org/papers/T0246). > > Maarten buis > > --- On Wed, 25/11/09, Jason Ferris wrote: >> I am aware of adding a constant and the transforming on the >> log scale (with antilog) for interpretation. > > The previous comments are useful and to the point, all I can > add is that this sugestion by the original poster will _not_ > give you an estimate of the mean. Notice that the logarithm > is a non-linear transformation, so taking a logarithm of a > variable, computing a mean, and than backtransform that mean > to the original metric will not give you the mean of the > original variable. If you didn't add the constant you would > have gotten geometric mean, but by adding the constant you'll > just get a meaningless number. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: How to include exogenous variables in VECM?***From:*"Nikolas Wölfing" <Woelfing@zew.de>

**References**:**Re: st: highly skewed, highly zeroed data***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**R: st: highly skewed, highly zeroed data***From:*"Carlo Lazzaro" <carlo.lazzaro@tin.it>

**RE: st: highly skewed, highly zeroed data***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: re: overid error** - Next by Date:
**RE: st: How to label bars with frequency AND percentage for categorical variables?** - Previous by thread:
**R: st: highly skewed, highly zeroed data** - Next by thread:
**st: How to include exogenous variables in VECM?** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |