# Re: st: logistic tranformation, proportion variables

 From Marck Bulter <[email protected]> To [email protected] Subject Re: st: logistic tranformation, proportion variables Date Fri, 14 Dec 2007 01:23:30 +0100

Nick Cox wrote:
I agree with Austin here. The fudge mapping zero -> smidgen before logit cannot be harmless as logit(smidgen) will become very large
negative
as smidgen becomes very small. I can't see an easy trade-off there. Otherwise put, the smaller smidgen is, the more you create outliers in your predictor space. Better not to transform or to use a transform that is not problematic at 0. And square roots are not.
In addition, exact zeros sometimes convey qualitative information. If a predictor is proportion of income spent on tobacco, then the people with zeros presumably don't smoke and pretending that they do
(even a little) is a distortion of the data.
Without doubt, people do this kind of fudge, and sometimes the
argument is that they can't think of a better way, but I won't sign up to approve.
Nick

P.S... -mlowess- is intensive because -lowess- is.
Austin Nichols

Marck--
No, replacing 0 with .001 is not appropriate, unless replacing it with
.0001 or .0000001 or 1e-30 etc. instead has no impact on the results,
in which case you could just drop the zeros and get the same results.
Also: Why is the sqrt(0) problematic?

My guess is that a better solution to your problem would be grounded
in theory. What is this regression supposed to measure the effects
of? If y is a proportion and x1 and x2 are proportions, and they
"want to be" transformed via logits, perhaps you should be using the
logs of the numerators and denominators of those variables, since
logit(a/(a+b))=ln(a)-ln(a+b)
so including the logit of a proportion X as an explanatory var is the
same as including the logs of its numerator and denominator and
constraining the coefficients N and D to satisfy N+D=0, which is a
testable restriction. Using the logit of a proportion Y as an
explanatory var is the same as using the log of its numerator as the
depvar and the log of the denominator as a regressor and constraining
the coefficient on the log of the denominator to be 1, which is also a
testable restriction.

Of course, if the numerator is zero, the log is undefined and those
obs will drop out of the estimation. Theory can also help you here
sometimes--in particular, perhaps the sqrt(X) is actually what has a
linear effect on Y, not X, as Nick suggests.

On Dec 13, 2007 11:58 AM, Marck Bulter <[email protected]> wrote:

Nick Cox wrote:

"Little" is not the adjective that springs to mind
for that help file.

More important, I don't think that help file answers
much of the question here.

As 0 and 1 are attainable, logit in the strict sense is
out of the question.

It seems to me that the main issue with a predictor that is
a proportion is what is the shape of the function relating

response | other predictors

to

proportional predictor | other predictors

and, setting aside the instrumental variable aspect here,
one handle on that might be given by added variable plots
after a plain multiple regression -- or graphical near
equivalents such as -mrunning- or -mlowess-. Use -findit-
to locate these user-written programs.

My first stab at this would be to consider some power of
the predictor, say root or square. That way 0 and 1 stay
as they are but you can bend the scale in the middle.

Nick
[email protected]
Dear Nick,

I have read the transit files, these are very informative. Thank you
for

sharing. And thanks to David Airey for pointing me to transit. But
indeed, these do not answer my question entirely.

Strictly, 100% is possible, but the proportion data I have range from
0

to 0.8. The author of the following published article,

http://www.cepr.org/pubs/new-dps/dplist.asp?dpno=5153

converts 0 values to, 0.001 and 1 to 0.999. Not the most prettiest
solution, but strictly logistic trans. is no longer out of the
question.

My master thesis is an extension of a previous research, where the
author also used proportion dependent and independent variables, but
he

did not explain if and if he did, how he transformed the variables.

For your suggestion on root and square, Sqrt does improve thinks a
bit,

but of course the 0 values are problematic, in addition the resid
assumptions are problematic.
Do you think that the conversion to 0.001 is appropriate? And more
important, is it appropriate to use logistic transformed variables
both

as dependent and independent variables?

Sorry for not being entirely accurate the first time.

Regards,
Marck Bulter
Currently, mlowess is running, it is a bit computer intensive.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
Nick,

I totally agree, conversion is an awfull solution, fitting the data to the model. But still I have to do something with the heteroskedacity and the non normal resid's.
As suggested in your transit files, I will give folded transformation a try.