Carlo Lazzaro

statalist@hsphsun2.harvard.edu

st: R: highly skewed, highly zeroed data

Wed, 25 Nov 2009 09:21:56 +0100

As an alternative to Kieran's hint, due to the positive skewness of his data Jason may find useful to calculate the desired 95CI% by fitting a Gamma distribution and drawing 10,000 random values from it (for two interesting references, please see: Briggs, A. and Nixon, R. and Dixon, S. and Thompson, S. (2005). Parametric modelling of cost data: some simulation evidence. Health Economics 14(4):pp. 421-428; free downloadable at http://eprints.gla.ac.uk/4151/; Briggs A, Sculpher M, Claxton K. Decision Modelling for Health Economic Evaluation. Oxford: Oxford University Press, 2006: 77-120). ............................begin example................................. input time wt mean time [fweight = wt] Mean estimation Number of obs = 647 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ time | 1.605873 .2343624 1.145669 2.066077 -------------------------------------------------------------- set obs 10000 g Gamma=(.2343624^2/1.605873)*invgammap((1.605873/.2343624)^2, uniform()) sum Gamma Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- Gamma | 10000 1.605746 .2343959 .8457972 2.601775 centile Gamma, centile (2.5 97.5) -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- Gamma | 10000 2.5 1.177285 1.170511 1.187588 | 97.5 2.09881 2.083514 2.114182 ............................end example.................................... HTH and Kind Regards, Carlo -----Messaggio originale----- Da: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di Jason Ferris Inviato: mercoledì 25 novembre 2009 3.07 A: statalist@hsphsun2.harvard.edu Oggetto: st: highly skewed, highly zeroed data Hi, I have tried to find my answer in the statalist repository but nothing has quite hit the mark. I would like to calculate a mean and 95% CI of this data - which is highly skewed and the majority are zeros. I am aware of adding a constant and the transforming on the log scale (with antilog) for interpretation. However after adding a constant to overcome the zero issue and then transforming on the log scale I am still left with a highly skewed distribution. Which gets me no close to a mean and CI. PS. As this is survey data I would be most keen for the 'right' answer to be addressed in svy: terms Jason time (hrs) | Freq. Percent Cum. ------------+----------------------------------- 0 | 518 80.06 80.06 .25 | 2 0.31 80.37 .5 | 3 0.46 80.83 1 | 15 2.32 83.15 1.5 | 1 0.15 83.31 2 | 23 3.55 86.86 3 | 10 1.55 88.41 3.5 | 1 0.15 88.56 4 | 11 1.70 90.26 5 | 13 2.01 92.27 6 | 9 1.39 93.66 7 | 3 0.46 94.13 8 | 19 2.94 97.06 20 | 10 1.55 98.61 45 | 9 1.39 100.00 ------------+-----------------------------------

