Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

R: st: highly skewed, highly zeroed data


From   "Carlo Lazzaro" <carlo.lazzaro@tin.it>
To   <statalist@hsphsun2.harvard.edu>
Subject   R: st: highly skewed, highly zeroed data
Date   Wed, 25 Nov 2009 15:28:56 +0100

<Conversely, my instinct is that a gamma distribution as suggested by Carlo
Lazzaro does not look quite right for that kind of distribution, unless it
makes sense for the positive values only>.

Admittedly, as Nick pointed out, Jason's dataset reports an impressive
frequency of 0 observations (518 out of 647).
My previous suggestion about Gamma distribution comes from my experience in
dealing with health care related costs, whose distribution is usually
right-skewed. The main reasons for these behaviour are the following (for
two interesting references, please see: 
- Briggs, A. and Nixon, R. and Dixon, S. and Thompson, S. (2005). Parametric
modelling of cost data: some simulation evidence. Health Economics 14(4):pp.
421-428; free downloadable at http://eprints.gla.ac.uk/4151/; 
- Briggs A, Sculpher M, Claxton K. Decision Modelling for Health Economic
Evaluation. Oxford: Oxford University Press, 2006: 77-120):

- some patients may drop dead just a little bit after being enrolled in a
given arm of a given clinical trial. Hence, they accrue 0 costs;
- on the contrary, some patients may accrue a lot of cost due to, say,
adverse effects to a given therapy that are expensive to cure.

However, and Nick's remarks highlights a very tricky (and still unresolved)
issue), no one in our research field have ever issued (and I would assume
that this task is quite impossible to be accomplished) a guidance about the
quantitative meaning of "substantial proportion of zero observations"
(Briggs A, Clarke P, Polsky D, Glick H. Modelling the cost of health care
interventions. Paper prepared for DEEM III: Costing Methods for Economic
Evaluation. University of Aberdeen, 15-16th April 2003) in cost
distributions.

Eventually, I do hope that I will never come across in a data set like this
in my next economic evaluation of health care programmes!!

Kind Regards,
Carlo
-----Messaggio originale-----
Da: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di Nick Cox
Inviato: mercoledì 25 novembre 2009 14.18
A: statalist@hsphsun2.harvard.edu
Oggetto: RE: st: highly skewed, highly zeroed data

Contrary to the original signal, this is one of the most frequently
debated topics on this list, and for very good reason. What best to do
with highly skewed data that seem to cry out for a log transformation
except that in fact they include several zeros isn't obvious. A good
answer will depend not only on what the data are like but also on what
you know about the underlying process (which questioners typically do
not spell out) and on what exactly you are trying to do _and_ why (on
which there is on average a little more detail). Nudging the values to
all positive is also a fudge and a kludge, although sometimes it does
yield sensible results! 

I wish someone (else) would write a review paper or monograph about
this, but I don't know of one. 

In addition to other comments, here are two, utterly standard except
that they haven't appeared in this thread so far: 

1. The sample data are so skewed that it isn't obvious that any kind of
mean makes practical sense, although I agree strongly with Kieran McCaul
that you can do it directly and take comfort from the central limit
theorem. 

2. Without process information, one wonders whether there is really a
two-step process: some do, some don't; and those that do have times from
a very skewed distribution. There are lots of paths to follow in that
direction, perhaps most simply some zero-inflated distribution, although
I imagine you'd have to program it yourself. Conversely, my instinct is
that a gamma distribution as suggested by Carlo Lazzaro does not look
quite right for that kind of distribution, unless it makes sense for the
positive values only. 

Nick 
n.j.cox@durham.ac.uk 

Carlo Lazzaro

Taking Maarten's wise remark forward, Jason (and whoever is interested
in
this tricky topic) might want to take a look at "old but gold": 
Manning WG, Mullahy J. Estimating Log Models: To Transform Or Not To
Transform? National Bureau Of Economic Research, Technical Working Paper
246, 1999 (downloadable with some restrictions from
http://www.nber.org/papers/T0246).

Maarten buis

--- On Wed, 25/11/09, Jason Ferris wrote:
> I am aware of adding a constant and the transforming on the
> log scale (with antilog) for interpretation.

The previous comments are useful and to the point, all I can
add is that this sugestion by the original poster will _not_ 
give you an estimate of the mean. Notice that the logarithm 
is a non-linear transformation, so taking a logarithm of a 
variable, computing a mean, and than backtransform that mean 
to the original metric will not give you the mean of the 
original variable. If you didn't add the constant you would 
have gotten geometric mean, but by adding the constant you'll 
just get a meaningless number.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index