# Re: st: R: ttest and log transformation

 From "Richard Harvey" <[email protected]> To [email protected] Subject Re: st: R: ttest and log transformation Date Sun, 28 Sep 2008 12:32:33 +0100

```Hi ,

sample size. In the summary stats I posted the N is large as it is for
the whole sample  but when I analyse subsamples there are some every
small samples. i.e less than 20.

The bootstrap seems like a good idea.  Can I do something  as simple as

bootstrap r(t) reps(1000) saving(c:\), ttest var, by(catvar)  unpaired unequal

or is it  something more involved as below?

bootstrap r(mean) if catvar=="cat1", reps(1000):sum var
matrix mu_1=e(b)
matrix sterrsq_1=e(V)
bootstrap r(mean) if catvar=="cat2", reps(1000):sum var
matrix mu_2=e(b)
matrix sterrsq_2=e(V)
scalar Z=((mu_1[1,1]- mu_2[1,1])/sqrt(sterrsq_1[1,1]+ sterrsq_2[1,1]))
scalar p=(1-normal(abs(z)))*2
di "z-value: "[Z]
di "p = "[p]

thanks very much for your help

regards
rich

2008/9/27 Carlo Lazzaro <[email protected]>:
>
> Dear Rich,
> steps:
>
> take a look at the resulting sampling distribution; perform a bootstrap
> ttest; calculate how many times the t_bootstrap is >= t_original  and =<
> t_original contrast the obtained bootstrap p_value with the original one
> ---------------------------begin example-----------------------------------
> set obs 100
> g A=10*(uniform())
> g B=15*(uniform())
> swilk A B //  Prob>z_A=0.00030; Prob>z_B=0.00032 // Both A and B are not
> normal ttest A == B, unpaired unequal  //t =  -5.6293 and Pr(|T| > |t|) =
> 0.0000 return list scalar t=r(t) summarize A, mean replace A=A-r(mean) +
> 6.198467 summarize B, mean replace B=B-r(mean) + 6.198467 sum A B bootstrap
> r(t), reps(10000) saving(C:\Documents and
> Settings\carlo\Documenti\Statistiche\Stata\Richard_boot.dta, every(1)
> replace)verbose : ttest A == B, unpaired unequal save "C:\Documents and
> Settings\carlo\Documenti\Statistiche\Stata\Richard_preboot.dta", replace use
> "C:\Documents and
> Settings\carlo\Documenti\Statistiche\Stata\Richard_boot.dta", clear count if
> _bs_1>=5.6293 //= 0 count if _bs_1<=-5.6293 //= 0 //bootstrap
> p-value=(0+0)/10000=0 confirm the p-value calculated on the grounds of the
> ------------------------------end example-----------------------------------
>
>
> About adding an arbitrary constraining or constant in the occurence ob log
> transformed data, I would refer you to a debate on this list held at the end
> of the last March and raised by a question on this topic. To sum up the
> results of the abomentioned debate, the answer was negative.
>
> However, so called shifted log transformation (that is, adding a constant
> before taking logs in order to make the retention of zeros in the data
> feasible), are reported in the literature concerning health care programmes
> cost comparison (please see, for a thorough review and many useful comments
> on this issue Barber JA, Thompson SG. Analysis of cost data in randomized
> trials: an application of the non-parametric bootstrap. Statist. Med. 2000;
> 19:3219-3236). As usual, the main problem is in your way back (that is, in
> back transforming from log in the original metric: that's a reason why I
> prefer non-parametric bootstrap for analysing skewed cost data).
>
> HTH and Kind Regards.
>
>
> Carlo
> -----Messaggio originale-----
> Da: [email protected]
> [mailto:[email protected]] Per conto di Richard Harvey
> Inviato: sabato 27 settembre 2008 10.15
> A: [email protected]
> Oggetto: st: ttest and log transformation
>
> Hi all,
>
> I hope I can ask a fairly basic stats question. I have a variable that
> i need to compare across two groups.
> the summary stats for the variable NAN  across the groups is as below.
> The negative values are legitimate.
>
> group   |            N             mean             p50           max
>               min                skewness  kurtosis
>
> group1 |           2537         -77535           5278       19051350
>   -46844688         -11.23          311.1
> group2 |           3031        -211373           4620        4609996
>   -32617714         -11.18          185.6
>  Total   |          5568        -150391           4958       19051350
>    -46844688         -11.33          278.4
>
> If a do a ttest on the log transformed data, is it appropriate to add
> an arbitrary constraint to make the negative values positive?  Is the
> ttest indeed any good for this data, or should I be looking at some
> non parametric tests.
>
> to make the numbers more manageble is divide by 1000,000 and the
> summary stats look like this
>
> group             N             mean    p50                     max
> min     skewness        kurtosis
>
> group1          2537            -.07753 .005278         19.05   -46.84
> -11.23  311.1
> group2          3031            -.2114  .00462          4.61
> -32.62  -11.18  185.6
> Total                   5568            -.1504  .004958         19.05
> -46.84  -11.33  278.4
>
> Is it right to perform ttest on ln((NAN/1000000)+50) ? changing the
> constant i add dosent seem to make a difference.
>
> stats on ln((NAN/100000)+50) is as below
>
> group                N          mean    p50                     max
> min
>        skewness        kurtosis
>
> group1          2537            4.604   4.605           4.78           3.973
> -17.21  527.4
> group2          3031            4.603   4.605           4.65
> 4.21                 12.74      242.9
> Total                   5568    4.604   4.605           4.78           3.973
> -15.94  469
>
> There is still a large negative skewness coefficient.  To me this
> looks like not a situation for a  ttest and I should be looking at
> some non parametric test. Is that right?
>
> The results from the ttest using the unpaired and unequal option,
> using the untransformed and using ln((NAN/100000)+50) are as below
>
> transformation               t                 p                       95%
> CI
> None                          3.25            .0011
> 53205.45-214470.8
> log(50+var)                 2.75            .0060
> .000367 - .002185 ( I understand this has to be back transformed)
>
> a ranksum test on the logtransformed NAN shows a z of 3.3999 with a p
> of .0007.on the untransformed NAN it is 3.396 with p of .0007
>
> so overall, there dosent seem to be any change in the conclusions,
> what ever test I use. But is the ttest procedure appropriate?
>
> You help is much appreciated.
> --
> rich
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--