Based on how the percentages are computed [100*(x-y)/(w+x+y)], Yeo-
Johnson transformation does seem appropriate.
I follow most of what Kit suggested. Thanks again. However, based on
the Weisberg paper on Yeo-Johnson transformation (www.stat.umn.edu/
arc/yjpower.pdf), I have a different interpretation on four aspects.
1. I believe I should be using 2-`theta' instead of 2*`theta' at both
places toward the end of the code you suggested.
2. I believe Equation 2 on page 1 of the above PDF file is the one
being modeled. This includes two possibilities for y<0, one when
lambda <> 2 (I believe this captured in the line two above else in
your suggested code), and the other when lambda = 2 (which I am don't
think is captured).
3. There should be a negative sign prior to ( ( (abs($ML_y1)+1)^(2-
`theta')-1)/(2-`theta' )
4. In the line after else, I believe there should be a +1 within
parentheses.
Assuming I am right on the above points, should the last block of
code be as follows?
qui gen double `yt' = .
if `diffL'> 1e-10 {
qui replace `yt' =( ( ($ML_y1+1)^`theta'-1)/ `theta' ) if $ML_y1 >= 0
qui replace `yt' = -( ( (abs($ML_y1)+1)^(2-`theta')-1)/(2-`theta' )
if ($ML_y1 < 0 and `diffL' <>2)
<== B
qui replace `yt' = -ln((abs($ML_y1)+1) if ($ML_y1 < 0 and `diffL' =2)
<== A
}
else {
qui replace `yt' = ln( $ML_y1+1 )
}
I did not examine the Weisberg paper (I looked at the R reference you
gave) so I do not know whether your interpretation of that paper is
correct. But if it is, there are a couple of errors in your code from a
syntactical standpoint. See lines <== A and <== B above.
A) the if clause references `diffl'=2 where it should say `diffl'==2 to
compare to 2. `difffl' is unlikely to ever evaluate to exactly 2, but
you want to compare it, not set it.
B) This form of the if statement should not generally be used with
variables. It is syntactically valid but hardly ever logically valid.
What
if($ML<y1 < 0 ... means is the FIRST OBSERVATION on the variable
negative. If you want to apply a conditional statement depending on
whether each value of $ML_y1 is negative, you do it with an if
condition, not a "programmer's if". Such an if is valid when testing
`diffl' against a small threshold because `diffl' is a scalar value.
That said, I do not know whether a version of this program with those
changes dealt with will faithfully produce the Yeo-Johnson
transformation, but it should be possible to try it out and compare to
Yeo-Johnson estimates available elsewhere (e.g. from the R code).