st: log-transformation of an independent variable in logistic regression: What to do with the zeroes

 From "Daniel Waxman" To statalist@hsphsun2.harvard.edu Subject st: log-transformation of an independent variable in logistic regression: What to do with the zeroes Date Thu, 1 Nov 2007 12:11:00 -0400

```Statalist,

This is a follow up to a side-discussion that came up in another
thread last week.   My suggested method for using a dummy variable to
represent zero values of the untransformed independent variable was
met with some skepticism, and I had promised an example/demonstration…

The rationale for doing the use of a dummy is as follows:

First:  A value of zero is often qualitatively different from any
positive value.  In the example that follows, I describe the
relationship between troponin I and mortality.  Troponin, like any
laboratory test, has a detection limit below which any concentration
will be reported as zero.  In my example, that detection limit is .01
mcg/L.  So a value reported by the laboratory analyzer as zero might
actually represent actual concentrations of .001 mcg/L, .0001 mcg/L,
or "truly zero."  Most likely it represents an admixture of all of
these.    The choice of a single value to represent zero would be
arbitrary, and different choices can give markedly different results
if the regression is fit with only one continuous variable.  Another
example, given by Hosmer & Lemeshow is the effect of number of
cigarettes per day on a particular health outcome.  There is good
reason to believe that the condition of nonsmoking does not fit
anywhere on a continuous scale that is a function of the number of
cigarettes smoked per day.

Second:  The fact that the log(0) does not exist is not simply a
mathematical nuisance.   If an outcome relates to the logarithm of
another variable, the meaning is that the outcome varies with
order-of-magnitude changes.  For example for troponin I, the odds of
mortality approximately doubles with 10-fold increases in positive
values.   What can this mean with regard to a troponin of zero?  Well,
if it were truly zero, then an order of magnitude change is
meaningless, and the case of zero would need to be dealt with
separately.

The solution is to use two variables—one to represent log10(troponin)
for positive values and a dummy variable to represent the case of
troponin==0.
It is important to formulate the dummy variable as zero for cases of
troponin==0 and 1 otherwise, and not the other way around.

When I mentioned this before, Nick seemed to trip over the replacement
of log(troponin) with zero if troponin==0, since log(1)=0.
The formulation of the dummy variable takes care of that.

In the following example (which can be cut-and-pasted into Stata), I
first create a simulated data set with the following characteristics:

-20% of troponin values are zero.
-The remainder of observations are uniformly distributed between .01-100
- Mortality rate for troponin==0 is 2%
- Mortality rate for troponin==.01 is 3% and doubles for every 10-fold increase.

I then perform a logistic regression of the simulated data using the
dummy variable formulation and compare predicted probabilities to the
parameters used to create the data.

Apologies in advance if I am missing Nick's previous point, or of this
seems absurdly obvious.
If anybody knows of a more "standard" way to do this, please let me know.

Dan

**********************example************************

/*

First, make the simulated data

*/

clear
set seed 12234
set obs 100000

// set 20% of troponin values to zero

gen troponin=0 if uniform()<=.2

// set rest of the values to between .01 and 100

replace troponin=round(uniform()*100,.01) if troponin==.

// 2% mortality if troponin==0

gen died=uniform()<=.02 if troponin==0
gen p_design = .02 if troponin==0

// 3% mortality for lowest nonzero value (troponin==.01)
//	with a doubling of the odds for each 10-fold increase
//
// note that since log(1)=0, the intercept represents the odds of mortality
//			for troponin==1
//

local baseline_odds=.03/(1-.03)

replace p_design=invlogit(log(2)*log10(troponin) /*
*/ + log(`baseline_odds'*4)) if troponin>0

replace died=uniform()<=p_design if troponin>0

/*

Now perform logistic regression for mortality vs. log10- troponin
on the simulated data,
using the variable <zero_dummy>=0 if troponin==0
and <logtrop_pos>=log10(troponin) if troponin>0, zero otherwise

*/

gen logtrop_pos=cond(troponin==0,0,log10(troponin))
gen zero_dummy=cond(troponin==0,0,1)

logistic died logtrop_pos zero_dummy
predict p_fitted

// compare results of fitted values to the initial parameters

su p_fitted p_design if troponin==0
su p_fitted p_design if troponin==float(.01)
su p_fitted p_design if troponin==float(.1)
su p_fitted p_design if troponin==1
su p_fitted p_design if troponin==10

************end example**********

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```