# st: re: Yeo-Johnson power transformation

 From Kit Baum <[email protected]> To [email protected] Subject st: re: Yeo-Johnson power transformation Date Tue, 23 Jan 2007 14:16:17 -0500

Rajiv said

Based on how the percentages are computed [100*(x-y)/(w+x+y)], Yeo-
Johnson transformation does seem appropriate.

I follow most of what Kit suggested. Thanks again. However, based on
the Weisberg paper on Yeo-Johnson transformation (www.stat.umn.edu/
arc/yjpower.pdf), I have a different interpretation on four aspects.

1. I believe I should be using 2-`theta' instead of 2*`theta' at both
places toward the end of the code you suggested.
2. I believe Equation 2 on page 1 of the above PDF file is the one
being modeled. This includes two possibilities for y<0, one when
lambda <> 2 (I believe this captured in the line two above else in
your suggested code), and the other when lambda = 2 (which I am don't
think is captured).
3. There should be a negative sign prior to ( ( (abs(\$ML_y1)+1)^(2-
`theta')-1)/(2-`theta' )
4. In the line after else, I believe there should be a +1 within
parentheses.

Assuming I am right on the above points, should the last block of
code be as follows?

qui gen double `yt' = .
if `diffL'> 1e-10 {
qui replace `yt' =( ( (\$ML_y1+1)^`theta'-1)/ `theta' ) if \$ML_y1 >= 0
qui replace `yt' = -( ( (abs(\$ML_y1)+1)^(2-`theta')-1)/(2-`theta' )
if (\$ML_y1 < 0 and `diffL' <>2) <== B
qui replace `yt' = -ln((abs(\$ML_y1)+1) if (\$ML_y1 < 0 and `diffL' =2) <== A
}
else {
qui replace `yt' = ln( \$ML_y1+1 )
}

I did not examine the Weisberg paper (I looked at the R reference you gave) so I do not know whether your interpretation of that paper is correct. But if it is, there are a couple of errors in your code from a syntactical standpoint. See lines <== A and <== B above.

A) the if clause references `diffl'=2 where it should say `diffl'==2 to compare to 2. `difffl' is unlikely to ever evaluate to exactly 2, but you want to compare it, not set it.

B) This form of the if statement should not generally be used with variables. It is syntactically valid but hardly ever logically valid. What

if(\$ML<y1 < 0 ... means is the FIRST OBSERVATION on the variable negative. If you want to apply a conditional statement depending on whether each value of \$ML_y1 is negative, you do it with an if condition, not a "programmer's if". Such an if is valid when testing `diffl' against a small threshold because `diffl' is a scalar value.

That said, I do not know whether a version of this program with those changes dealt with will faithfully produce the Yeo-Johnson transformation, but it should be possible to try it out and compare to Yeo-Johnson estimates available elsewhere (e.g. from the R code).

Kit Baum, Boston College Economics
http://ideas.repec.org/e/pba1.html
An Introduction to Modern Econometrics Using Stata:
http://www.stata-press.com/books/imeus.html

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/