[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Daniel Waxman" <dan@amplecat.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: log-transformation of an independent variable in logistic regression: What to do with the zeroes |

Date |
Thu, 1 Nov 2007 12:11:00 -0400 |

Statalist, This is a follow up to a side-discussion that came up in another thread last week. My suggested method for using a dummy variable to represent zero values of the untransformed independent variable was met with some skepticism, and I had promised an example/demonstration… The rationale for doing the use of a dummy is as follows: First: A value of zero is often qualitatively different from any positive value. In the example that follows, I describe the relationship between troponin I and mortality. Troponin, like any laboratory test, has a detection limit below which any concentration will be reported as zero. In my example, that detection limit is .01 mcg/L. So a value reported by the laboratory analyzer as zero might actually represent actual concentrations of .001 mcg/L, .0001 mcg/L, or "truly zero." Most likely it represents an admixture of all of these. The choice of a single value to represent zero would be arbitrary, and different choices can give markedly different results if the regression is fit with only one continuous variable. Another example, given by Hosmer & Lemeshow is the effect of number of cigarettes per day on a particular health outcome. There is good reason to believe that the condition of nonsmoking does not fit anywhere on a continuous scale that is a function of the number of cigarettes smoked per day. Second: The fact that the log(0) does not exist is not simply a mathematical nuisance. If an outcome relates to the logarithm of another variable, the meaning is that the outcome varies with order-of-magnitude changes. For example for troponin I, the odds of mortality approximately doubles with 10-fold increases in positive values. What can this mean with regard to a troponin of zero? Well, if it were truly zero, then an order of magnitude change is meaningless, and the case of zero would need to be dealt with separately. The solution is to use two variables—one to represent log10(troponin) for positive values and a dummy variable to represent the case of troponin==0. It is important to formulate the dummy variable as zero for cases of troponin==0 and 1 otherwise, and not the other way around. When I mentioned this before, Nick seemed to trip over the replacement of log(troponin) with zero if troponin==0, since log(1)=0. The formulation of the dummy variable takes care of that. In the following example (which can be cut-and-pasted into Stata), I first create a simulated data set with the following characteristics: -20% of troponin values are zero. -The remainder of observations are uniformly distributed between .01-100 - Mortality rate for troponin==0 is 2% - Mortality rate for troponin==.01 is 3% and doubles for every 10-fold increase. I then perform a logistic regression of the simulated data using the dummy variable formulation and compare predicted probabilities to the parameters used to create the data. Apologies in advance if I am missing Nick's previous point, or of this seems absurdly obvious. If anybody knows of a more "standard" way to do this, please let me know. Dan **********************example************************ /* First, make the simulated data */ clear set seed 12234 set obs 100000 // set 20% of troponin values to zero gen troponin=0 if uniform()<=.2 // set rest of the values to between .01 and 100 replace troponin=round(uniform()*100,.01) if troponin==. // 2% mortality if troponin==0 gen died=uniform()<=.02 if troponin==0 gen p_design = .02 if troponin==0 // 3% mortality for lowest nonzero value (troponin==.01) // with a doubling of the odds for each 10-fold increase // // note that since log(1)=0, the intercept represents the odds of mortality // for troponin==1 // local baseline_odds=.03/(1-.03) replace p_design=invlogit(log(2)*log10(troponin) /* */ + log(`baseline_odds'*4)) if troponin>0 replace died=uniform()<=p_design if troponin>0 /* Now perform logistic regression for mortality vs. log10- troponin on the simulated data, using the variable <zero_dummy>=0 if troponin==0 and <logtrop_pos>=log10(troponin) if troponin>0, zero otherwise */ gen logtrop_pos=cond(troponin==0,0,log10(troponin)) gen zero_dummy=cond(troponin==0,0,1) logistic died logtrop_pos zero_dummy predict p_fitted // compare results of fitted values to the initial parameters su p_fitted p_design if troponin==0 su p_fitted p_design if troponin==float(.01) su p_fitted p_design if troponin==float(.1) su p_fitted p_design if troponin==1 su p_fitted p_design if troponin==10 ************end example********** * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**RE: st: Test of ordered probit vs ordinary probits** - Next by Date:
**st: Graphics: yline in foreground** - Previous by thread:
**st: Countfit** - Next by thread:
**st: Graphics: yline in foreground** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |