# st: R: ttest and log transformation

 From Carlo Lazzaro To statalist@hsphsun2.harvard.edu Subject st: R: ttest and log transformation Date Sat, 27 Sep 2008 13:28:28 +0200

```Dear Rich,
steps:

take a look at the resulting sampling distribution; perform a bootstrap
ttest; calculate how many times the t_bootstrap is >= t_original  and =<
t_original contrast the obtained bootstrap p_value with the original one
---------------------------begin example-----------------------------------
set obs 100
g A=10*(uniform())
g B=15*(uniform())
swilk A B //  Prob>z_A=0.00030; Prob>z_B=0.00032 // Both A and B are not
normal ttest A == B, unpaired unequal  //t =  -5.6293 and Pr(|T| > |t|) =
0.0000 return list scalar t=r(t) summarize A, mean replace A=A-r(mean) +
6.198467 summarize B, mean replace B=B-r(mean) + 6.198467 sum A B bootstrap
r(t), reps(10000) saving(C:\Documents and
Settings\carlo\Documenti\Statistiche\Stata\Richard_boot.dta, every(1)
replace)verbose : ttest A == B, unpaired unequal save "C:\Documents and
Settings\carlo\Documenti\Statistiche\Stata\Richard_preboot.dta", replace use
"C:\Documents and
Settings\carlo\Documenti\Statistiche\Stata\Richard_boot.dta", clear count if
_bs_1>=5.6293 //= 0 count if _bs_1<=-5.6293 //= 0 //bootstrap
p-value=(0+0)/10000=0 confirm the p-value calculated on the grounds of the
------------------------------end example-----------------------------------

About adding an arbitrary constraining or constant in the occurence ob log
transformed data, I would refer you to a debate on this list held at the end
of the last March and raised by a question on this topic. To sum up the
results of the abomentioned debate, the answer was negative.

However, so called shifted log transformation (that is, adding a constant
before taking logs in order to make the retention of zeros in the data
feasible), are reported in the literature concerning health care programmes
cost comparison (please see, for a thorough review and many useful comments
on this issue Barber JA, Thompson SG. Analysis of cost data in randomized
trials: an application of the non-parametric bootstrap. Statist. Med. 2000;
19:3219-3236). As usual, the main problem is in your way back (that is, in
back transforming from log in the original metric: that's a reason why I
prefer non-parametric bootstrap for analysing skewed cost data).

HTH and Kind Regards.

Carlo
-----Messaggio originale-----
Da: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di Richard Harvey
Inviato: sabato 27 settembre 2008 10.15
A: statalist@hsphsun2.harvard.edu
Oggetto: st: ttest and log transformation

Hi all,

I hope I can ask a fairly basic stats question. I have a variable that
i need to compare across two groups.
the summary stats for the variable NAN  across the groups is as below.
The negative values are legitimate.

group   |            N             mean             p50           max
min                skewness  kurtosis

group1 |           2537         -77535           5278       19051350
-46844688         -11.23          311.1
group2 |           3031        -211373           4620        4609996
-32617714         -11.18          185.6
Total   |          5568        -150391           4958       19051350
-46844688         -11.33          278.4

If a do a ttest on the log transformed data, is it appropriate to add
an arbitrary constraint to make the negative values positive?  Is the
ttest indeed any good for this data, or should I be looking at some
non parametric tests.

to make the numbers more manageble is divide by 1000,000 and the
summary stats look like this

group	          N		mean	p50	                max
min	skewness	kurtosis

group1		2537		-.07753	.005278		19.05	-46.84
-11.23	311.1
group2		3031		-.2114	.00462		4.61
-32.62	-11.18	185.6
Total		        5568		-.1504	.004958		19.05
-46.84	-11.33	278.4

Is it right to perform ttest on ln((NAN/1000000)+50) ? changing the
constant i add dosent seem to make a difference.

stats on ln((NAN/100000)+50) is as below

group	             N		mean	p50	                max
min
skewness	kurtosis

group1		2537		4.604	4.605		4.78	       3.973
-17.21	527.4
group2		3031		4.603	4.605		4.65
4.21	             12.74	242.9
Total		        5568 	4.604	4.605		4.78	       3.973
-15.94	469

There is still a large negative skewness coefficient.  To me this
looks like not a situation for a  ttest and I should be looking at
some non parametric test. Is that right?

The results from the ttest using the unpaired and unequal option,
using the untransformed and using ln((NAN/100000)+50) are as below

transformation               t                 p                       95%
CI
None                          3.25            .0011
53205.45-214470.8
log(50+var)                 2.75            .0060
.000367 - .002185 ( I understand this has to be back transformed)

a ranksum test on the logtransformed NAN shows a z of 3.3999 with a p
of .0007.on the untransformed NAN it is 3.396 with p of .0007

so overall, there dosent seem to be any change in the conclusions,
what ever test I use. But is the ttest procedure appropriate?

You help is much appreciated.
--
rich
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```