# Re: st: highly skewed, highly zeroed data

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: highly skewed, highly zeroed data Date Wed, 25 Nov 2009 10:38:40 -0500

```Jason Ferris et al. --
That the variable is named time suggests a kind of censoring that may
need to be modeled; perhaps the poster can clarify the nature of the
data.  If this is hours spent volunteering last week, no modeling is
required to describe the distribution; if it is months unemployed,
describing the central tendency including zeros may no longer make
sense (and if zeros represent several kinds of data, e.g. employed and
not in labor force, you have various other problems to confront), but
you also cannot simply summarize the positive values. In general,
tranforming, calculating the mean and CI, and transforming the mean/CI
is not a good way to get measures of the central tendency, but that
general rule is made to be broken (but not by addding a constant
before logging).  In this case, giving the overall mean, the
proportion positive, and the mean of the positive values seems a
natural starting point.  You may prefer to report the median of
positive values instead of, or in addition to, the mean (the mean and
median of log positive values are quite close, but the mean of
the median in survey data does not have a nice sampling distribution,
but you can use the ugly kludge below to get an imperfect CI.

clear
input time wt
0 518
.25 2
.5 3
1 15
1.5 1
2 23
3 10
3.5 1
4 11
5 13
6 9
7 3
8 19
20 10
45 9
end
g pos=time>0
expand wt
drop wt
g w=ceil(uniform()*10)
g c=mod(_n,25)
svyset c [pw=w]
svy:mean time
svy:proportion pos
svy,subpop(pos):mean time

cap prog drop mysvymed
prog mysvymed
qreg `1' [aw=w]
end
preserve
keep if time>0
bs, cl(c): mysvymed time
restore

On Wed, Nov 25, 2009 at 8:17 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> Contrary to the original signal, this is one of the most frequently
> debated topics on this list, and for very good reason. What best to do
> with highly skewed data that seem to cry out for a log transformation
> except that in fact they include several zeros isn't obvious. A good
> answer will depend not only on what the data are like but also on what
> you know about the underlying process (which questioners typically do
> not spell out) and on what exactly you are trying to do _and_ why (on
> which there is on average a little more detail). Nudging the values to
> all positive is also a fudge and a kludge, although sometimes it does
> yield sensible results!
>
> I wish someone (else) would write a review paper or monograph about
> this, but I don't know of one.
>
> In addition to other comments, here are two, utterly standard except
> that they haven't appeared in this thread so far:
>
> 1. The sample data are so skewed that it isn't obvious that any kind of
> mean makes practical sense, although I agree strongly with Kieran McCaul
> that you can do it directly and take comfort from the central limit
> theorem.
>
> 2. Without process information, one wonders whether there is really a
> two-step process: some do, some don't; and those that do have times from
> a very skewed distribution. There are lots of paths to follow in that
> direction, perhaps most simply some zero-inflated distribution, although
> I imagine you'd have to program it yourself. Conversely, my instinct is
> that a gamma distribution as suggested by Carlo Lazzaro does not look
> quite right for that kind of distribution, unless it makes sense for the
> positive values only.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Carlo Lazzaro
>
> Taking Maarten's wise remark forward, Jason (and whoever is interested
> in
> this tricky topic) might want to take a look at "old but gold":
> Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To
> Transform? National Bureau Of Economic Research, Technical Working Paper
> http://www.nber.org/papers/T0246).
>
> Maarten buis
>
> --- On Wed, 25/11/09, Jason Ferris wrote:
>> I am aware of adding a constant and the transforming on the
>> log scale (with antilog) for interpretation.
>
> The previous comments are useful and to the point, all I can
> add is that this sugestion by the original poster will _not_
> give you an estimate of the mean. Notice that the logarithm
> is a non-linear transformation, so taking a logarithm of a
> variable, computing a mean, and than backtransform that mean
> to the original metric will not give you the mean of the
> original variable. If you didn't add the constant you would
> have gotten geometric mean, but by adding the constant you'll
> just get a meaningless number.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```