# Re: st: log-transformation of an independent variable in logisticregression: What to do with the zeroes

 From n j cox To statalist@hsphsun2.harvard.edu Subject Re: st: log-transformation of an independent variable in logisticregression: What to do with the zeroes Date Thu, 01 Nov 2007 16:59:08 +0000

The original discussion was in a thread started by Rosy Reynolds,

<http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist.0710/Subject/article-915.html>

Daniel changed the subject to this other topic -- also very interesting,
but as far as I can see only connected by a common problem of zeros and
a log scale. Thanks for the detailed follow-up.

I had difficulty with Daniel's contribution because I couldn't see that
it related to Rosy's model, which was quite different in form, and
I still don't.

As said earlier, I appreciate that the mapping cond(x == 0, 0, log(0)) is not as problematic as it looks given the use of a dummy for 0, but
that "given" still seems vital to me.

Nick
n.j.cox@durham.ac.uk

Daniel Waxman
--------------------------------------------------------------------------------

This is a follow up to a side-discussion that came up in another
thread last week. My suggested method for using a dummy variable to
represent zero values of the untransformed independent variable was
met with some skepticism, and I had promised an example/demonstration…

The rationale for doing the use of a dummy is as follows:

First: A value of zero is often qualitatively different from any
positive value. In the example that follows, I describe the
relationship between troponin I and mortality. Troponin, like any
laboratory test, has a detection limit below which any concentration
will be reported as zero. In my example, that detection limit is .01
mcg/L. So a value reported by the laboratory analyzer as zero might
actually represent actual concentrations of .001 mcg/L, .0001 mcg/L,
or "truly zero." Most likely it represents an admixture of all of
these. The choice of a single value to represent zero would be
arbitrary, and different choices can give markedly different results
if the regression is fit with only one continuous variable. Another
example, given by Hosmer & Lemeshow is the effect of number of
cigarettes per day on a particular health outcome. There is good
reason to believe that the condition of nonsmoking does not fit
anywhere on a continuous scale that is a function of the number of
cigarettes smoked per day.

Second: The fact that the log(0) does not exist is not simply a
mathematical nuisance. If an outcome relates to the logarithm of
another variable, the meaning is that the outcome varies with
order-of-magnitude changes. For example for troponin I, the odds of
mortality approximately doubles with 10-fold increases in positive
values. What can this mean with regard to a troponin of zero? Well,
if it were truly zero, then an order of magnitude change is
meaningless, and the case of zero would need to be dealt with
separately.

The solution is to use two variables—one to represent log10(troponin)
for positive values and a dummy variable to represent the case of
troponin==0.
It is important to formulate the dummy variable as zero for cases of
troponin==0 and 1 otherwise, and not the other way around.

When I mentioned this before, Nick seemed to trip over the replacement
of log(troponin) with zero if troponin==0, since log(1)=0.
The formulation of the dummy variable takes care of that.

In the following example (which can be cut-and-pasted into Stata), I
first create a simulated data set with the following characteristics:

-20% of troponin values are zero.
-The remainder of observations are uniformly distributed between .01-100
- Mortality rate for troponin==0 is 2%
- Mortality rate for troponin==.01 is 3% and doubles for every 10-fold increase.

I then perform a logistic regression of the simulated data using the
dummy variable formulation and compare predicted probabilities to the
parameters used to create the data.

Apologies in advance if I am missing Nick's previous point, or of this
seems absurdly obvious.
If anybody knows of a more "standard" way to do this, please let me know.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/