# Re: st: logistic tranformation, proportion variables

 From Marck Bulter <[email protected]> To [email protected] Subject Re: st: logistic tranformation, proportion variables Date Fri, 14 Dec 2007 00:54:14 +0100

```Austin Nichols wrote:
```
Marck--
No, replacing 0 with .001 is not appropriate, unless replacing it with
.0001 or .0000001 or 1e-30 etc. instead has no impact on the results,
in which case you could just drop the zeros and get the same results.
Also: Why is the sqrt(0) problematic?

My guess is that a better solution to your problem would be grounded
in theory. What is this regression supposed to measure the effects
of? If y is a proportion and x1 and x2 are proportions, and they
"want to be" transformed via logits, perhaps you should be using the
logs of the numerators and denominators of those variables, since
logit(a/(a+b))=ln(a)-ln(a+b)
so including the logit of a proportion X as an explanatory var is the
same as including the logs of its numerator and denominator and
constraining the coefficients N and D to satisfy N+D=0, which is a
testable restriction. Using the logit of a proportion Y as an
explanatory var is the same as using the log of its numerator as the
depvar and the log of the denominator as a regressor and constraining
the coefficient on the log of the denominator to be 1, which is also a
testable restriction.

Of course, if the numerator is zero, the log is undefined and those
obs will drop out of the estimation. Theory can also help you here
sometimes--in particular, perhaps the sqrt(X) is actually what has a
linear effect on Y, not X, as Nick suggests.

On Dec 13, 2007 11:58 AM, Marck Bulter <[email protected]> wrote:

Nick Cox wrote:

"Little" is not the adjective that springs to mind
for that help file.

More important, I don't think that help file answers
much of the question here.

As 0 and 1 are attainable, logit in the strict sense is
out of the question.

It seems to me that the main issue with a predictor that is
a proportion is what is the shape of the function relating

response | other predictors

to

proportional predictor | other predictors

and, setting aside the instrumental variable aspect here,
one handle on that might be given by added variable plots
after a plain multiple regression -- or graphical near
equivalents such as -mrunning- or -mlowess-. Use -findit-
to locate these user-written programs.

My first stab at this would be to consider some power of
the predictor, say root or square. That way 0 and 1 stay
as they are but you can bend the scale in the middle.

Nick
[email protected]

David Airey

Nick Cox has a little Stata help file on transformations.

ssc install transint

Marck Bulter

I have a question that is not entirely related to Stata. Do hope
that you forgive me.

Assume the following model,

*ivreg* pstrmon price maturity age coupon pstrmonprev pstrprev
intrest ivol compl (precmon = precmonprev)

Where pstrmon, pstrmonprev, precmon and precmonprev are all
proportions. In this case, value bond A / total value bonds, etc.
Therefore, it can take any value between 0 and 1, 0 and 1 included.
These last 4 variables are heavily left skewed. Post estimations,
resid is heteroskedastic, and resid is not normal distributed.
On the Statalist server I have found several references to logistic
transformations, ln(y/1-y):
- http://www.stata.com/statalist/archive/2003-07/msg00285.html
- home.fsw.vu.nl/m.buis/presentations/UKsug06.pdf
- http://www.stata.com/statalist/archive/2006-02/msg00150.html

If I transform the 4 variables using logistic transformation, the 4
variables or no longer skewed, resid is almost homoskedastic, and
resid is almost normal distributed.
But my question is, is this transformation allowed, as I have mostly
seen only references of transformation of the dependent variable.
In addition, the transformation makes the interpretation of the
coefficients hard, any comment on this?

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Dear Nick,

I have read the transit files, these are very informative. Thank you for
sharing. And thanks to David Airey for pointing me to transit. But
indeed, these do not answer my question entirely.

Strictly, 100% is possible, but the proportion data I have range from 0
to 0.8. The author of the following published article,

http://www.cepr.org/pubs/new-dps/dplist.asp?dpno=5153

converts 0 values to, 0.001 and 1 to 0.999. Not the most prettiest
solution, but strictly logistic trans. is no longer out of the question.
My master thesis is an extension of a previous research, where the
author also used proportion dependent and independent variables, but he
did not explain if and if he did, how he transformed the variables.

For your suggestion on root and square, Sqrt does improve thinks a bit,
but of course the 0 values are problematic, in addition the resid
assumptions are problematic.
Do you think that the conversion to 0.001 is appropriate? And more
important, is it appropriate to use logistic transformed variables both
as dependent and independent variables?

Sorry for not being entirely accurate the first time.

Regards,
Marck Bulter
Currently, mlowess is running, it is a bit computer intensive.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
Dear Austin,

I have tried against your advice, to transfer the 0's to 0.001 .0001 etc. But the resid scatter plot shows a clear straight bar through the scatter plot. So a number of resid's are lining up, and the effect becomes more profound (bar moves out of the cloud) if I decrease the conversion value. As a result I will skip the conversion. I have no idea how the author of

http://www.cepr.org/pubs/new-dps/dplist.asp?dpno=5153

managed to publish his paper, since I would expect that (s)he will see a similar result, (no resid scatter printed?).
But anyway, to answer your question regarding sqrt, the resid cloud is oriented to the left, and has a clear cutoff line from zero diagonal to the x axes. (I wish I could attach the plot, but I don't want to jam others mailboxes). So I am inclined to look for other options.

Regarding, logit(a/(a+b))=ln(a)-ln(a+b). I will (n/d) test what log n and log d gives. Thank you pointing this out, interesting suggestion.
To explain a bit about what I am trying to measure:

ivreg pstrmon price maturity age coupon l.pstrmonprev l.pstr interest ivolatily compl (precmon = l.precmon)

and

ivreg precmon price maturity age coupon l.precmon l.pstr intr ivol compl (pstrmon = l.pstrmon)

Here, pstrmon is the proportion (\$ value) of U.S. Treasury bond i, of the total value outstanding of bond type i, that is converted to zero coupon bonds per month, thus it is a measure of activity. precmon is similar, but opposite. The value that is converted to U.S. Treasury bonds of type i, of the total value of bond type i. pstr, is the proportion of bond (\$ value) i, that is held in converted form. So this is a dynamic process, going from normal bond to zero coupon bond, and back. In fixed income terms I would refer to this process as stripping and reconstitution activity. The other variables are bond properties, like price, coupon rate, maturity age, etc. To go back to the 0 value, in some months there is no activity. To make things even more interesting, this is a panel data study, since i=103, time period, 97 to 2006.

regards,

Marck Bulter

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/