Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: highly skewed, highly zeroed data


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: highly skewed, highly zeroed data
Date   Wed, 25 Nov 2009 15:43:22 -0000

I should perhaps add a very simple point but one that is sometimes
overlooked. Even in situations with such a high skew that analysts might
feel that a mean is dubious it can still make sense because of its link
to the total. Thus, these data are times of something in hours: the
total time could still be a useful thing to know in terms of the total
support required for patients, or whatever it is. A similar point
presumably applies to e.g. cost data as discussed by Carlo Lazzaro. 

Nick 
[email protected] 


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Nick Cox
Sent: 25 November 2009 13:18
To: [email protected]
Subject: RE: st: highly skewed, highly zeroed data

Contrary to the original signal, this is one of the most frequently
debated topics on this list, and for very good reason. What best to do
with highly skewed data that seem to cry out for a log transformation
except that in fact they include several zeros isn't obvious. A good
answer will depend not only on what the data are like but also on what
you know about the underlying process (which questioners typically do
not spell out) and on what exactly you are trying to do _and_ why (on
which there is on average a little more detail). Nudging the values to
all positive is also a fudge and a kludge, although sometimes it does
yield sensible results! 

I wish someone (else) would write a review paper or monograph about
this, but I don't know of one. 

In addition to other comments, here are two, utterly standard except
that they haven't appeared in this thread so far: 

1. The sample data are so skewed that it isn't obvious that any kind of
mean makes practical sense, although I agree strongly with Kieran McCaul
that you can do it directly and take comfort from the central limit
theorem. 

2. Without process information, one wonders whether there is really a
two-step process: some do, some don't; and those that do have times from
a very skewed distribution. There are lots of paths to follow in that
direction, perhaps most simply some zero-inflated distribution, although
I imagine you'd have to program it yourself. Conversely, my instinct is
that a gamma distribution as suggested by Carlo Lazzaro does not look
quite right for that kind of distribution, unless it makes sense for the
positive values only. 

Nick 
[email protected] 

Carlo Lazzaro

Taking Maarten's wise remark forward, Jason (and whoever is interested
in
this tricky topic) might want to take a look at "old but gold": 
Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To
Transform? National Bureau Of Economic Research, Technical Working Paper
246, 1999 (downloadable with some restrictions from
http://www.nber.org/papers/T0246).

Maarten buis

--- On Wed, 25/11/09, Jason Ferris wrote:
> I am aware of adding a constant and the transforming on the
> log scale (with antilog) for interpretation.

The previous comments are useful and to the point, all I can
add is that this sugestion by the original poster will _not_ 
give you an estimate of the mean. Notice that the logarithm 
is a non-linear transformation, so taking a logarithm of a 
variable, computing a mean, and than backtransform that mean 
to the original metric will not give you the mean of the 
original variable. If you didn't add the constant you would 
have gotten geometric mean, but by adding the constant you'll 
just get a meaningless number.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index