[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
n j cox <n.j.cox@durham.ac.uk> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: log-transformation of an independent variable in logisticregression: What to do with the zeroes |

Date |
Thu, 01 Nov 2007 16:59:08 +0000 |

The original discussion was in a thread started by Rosy Reynolds,

<http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist.0710/Subject/article-915.html>

Daniel changed the subject to this other topic -- also very interesting,

but as far as I can see only connected by a common problem of zeros and

a log scale. Thanks for the detailed follow-up.

I had difficulty with Daniel's contribution because I couldn't see that

it related to Rosy's model, which was quite different in form, and

I still don't.

As said earlier, I appreciate that the mapping cond(x == 0, 0, log(0)) is not as problematic as it looks given the use of a dummy for 0, but

that "given" still seems vital to me.

Nick

n.j.cox@durham.ac.uk

Daniel Waxman

--------------------------------------------------------------------------------

This is a follow up to a side-discussion that came up in another

thread last week. My suggested method for using a dummy variable to

represent zero values of the untransformed independent variable was

met with some skepticism, and I had promised an example/demonstration…

The rationale for doing the use of a dummy is as follows:

First: A value of zero is often qualitatively different from any

positive value. In the example that follows, I describe the

relationship between troponin I and mortality. Troponin, like any

laboratory test, has a detection limit below which any concentration

will be reported as zero. In my example, that detection limit is .01

mcg/L. So a value reported by the laboratory analyzer as zero might

actually represent actual concentrations of .001 mcg/L, .0001 mcg/L,

or "truly zero." Most likely it represents an admixture of all of

these. The choice of a single value to represent zero would be

arbitrary, and different choices can give markedly different results

if the regression is fit with only one continuous variable. Another

example, given by Hosmer & Lemeshow is the effect of number of

cigarettes per day on a particular health outcome. There is good

reason to believe that the condition of nonsmoking does not fit

anywhere on a continuous scale that is a function of the number of

cigarettes smoked per day.

Second: The fact that the log(0) does not exist is not simply a

mathematical nuisance. If an outcome relates to the logarithm of

another variable, the meaning is that the outcome varies with

order-of-magnitude changes. For example for troponin I, the odds of

mortality approximately doubles with 10-fold increases in positive

values. What can this mean with regard to a troponin of zero? Well,

if it were truly zero, then an order of magnitude change is

meaningless, and the case of zero would need to be dealt with

separately.

The solution is to use two variables—one to represent log10(troponin)

for positive values and a dummy variable to represent the case of

troponin==0.

It is important to formulate the dummy variable as zero for cases of

troponin==0 and 1 otherwise, and not the other way around.

When I mentioned this before, Nick seemed to trip over the replacement

of log(troponin) with zero if troponin==0, since log(1)=0.

The formulation of the dummy variable takes care of that.

In the following example (which can be cut-and-pasted into Stata), I

first create a simulated data set with the following characteristics:

-20% of troponin values are zero.

-The remainder of observations are uniformly distributed between .01-100

- Mortality rate for troponin==0 is 2%

- Mortality rate for troponin==.01 is 3% and doubles for every 10-fold increase.

I then perform a logistic regression of the simulated data using the

dummy variable formulation and compare predicted probabilities to the

parameters used to create the data.

Apologies in advance if I am missing Nick's previous point, or of this

seems absurdly obvious.

If anybody knows of a more "standard" way to do this, please let me know.

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: Graphics: yline in foreground** - Next by Date:
**Re: st: Graphics: yline in foreground** - Previous by thread:
**st: Graphics: yline in foreground** - Next by thread:
**st: Data Manipulation** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |