[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Joseph Coveney" <jcoveney@bigplanet.com> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: Re: Modeling an independent variable with a very high data density at x=0 |

Date |
Sat, 6 Jun 2009 16:07:05 +0900 |

Allan Garland wrote: I'm doing a logistic regression using a non-negative, continuous independent variable X, for which about 60% of cases have X=0. It seems to me that just including X in the model is problematic, since it is likely that many cases with Y=0 and many others with Y=1 will have X=0. I can think of 2 possible approaches to modeling X, but would like some feedback on them, and any other thoughts on how to handle this situation. a) Divide X into m categories and represent it with m-1 dummy variables in the model. b) Include X in the model, and also include a binary variable Z such that Z=1 when X=0 and Z=0 otherwise. Then the effect of X=0 is given by the coefficient of Z, and the effect of X>0 is purely given by the coefficient of X itself (since then Z=0). -------------------------------------------------------------------------------- I would base the model as much as I could on subject matter or scientific interpretability. But in the absence of any guidance from those, here's one general approach to look at it (below). The general approach can be specifically developed, with resampling/cross-validation, examination of calibration curves, and so on. Frank E. Harrell, Jr., has written a good resource for approaches to modeling, _Regression Modeling Strategies_ (NY: Springer-Verlag, 2001). (-genbinomial- is user-written, and so you'll need to install it if you haven't already. -findit genbinomial- will point you to its location.) Joseph Coveney clear * set more off set seed `=date("2009-06-06", "YMD")' set obs 250 generate double x = runiform() replace x = 0 in 1/`=0.6 * 250' generate double xb = -1 + 0.5 * x set seed0 `=date("2009-06-06", "YMD")' genbinomial y, xbeta(xb) n(1) link(logit) /* a) Divide X into m categories and represent it with m-1 dummy variables in the model. */ egen byte x_categorized = cut(x), at(0(0.2)1) tabulate x_categorized, generate(x_) logistic y x_2-x_5 estimates store a /* Include X in the model, and also include a binary variable Z such that Z=1 when X=0 and Z=0 otherwise. */ generate byte z = x == 0 logistic y x z estimates store b lrtest a b, stats * b wins /* A third option. One that you might have rejected out-of-hand. */ logistic y x estimates store c lrtest b c, stats * c wins Exit * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Modeling an independent variable with a very high data density at x=0***From:*Allan Garland <agar5858@shaw.ca>

- Prev by Date:
**st: Modeling an independent variable with a very high data density at x=0** - Next by Date:
**Re: st: Re: JJQ : st: Evaluatng Instrument Strenght in the Arrelano and Bond (1998) GMM System Estimator** - Previous by thread:
**st: Modeling an independent variable with a very high data density at x=0** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |