# st: Re: Modeling an independent variable with a very high data density at x=0

 From "Joseph Coveney" To Subject st: Re: Modeling an independent variable with a very high data density at x=0 Date Sat, 6 Jun 2009 16:07:05 +0900

```Allan Garland wrote:

I'm doing a logistic regression using a non-negative, continuous independent
variable X, for which about 60% of cases have X=0.  It seems to me that just
including X in the model is problematic, since it is likely that many cases with
Y=0 and many others with Y=1 will have X=0.  I can think of 2 possible
approaches to modeling X, but would like some feedback on them, and any other
thoughts on how to handle this situation.
a) Divide X into m categories and represent it with m-1 dummy variables in the
model.
b) Include X in the model, and also include a binary variable Z such that Z=1
when X=0 and Z=0 otherwise.  Then the effect of X=0 is given by the coefficient
of Z, and the effect of X>0 is purely given by the
coefficient of X itself (since then Z=0).

--------------------------------------------------------------------------------

I would base the model as much as I could on subject matter or scientific
interpretability.  But in the absence of any guidance from those, here's one
general approach to look at it (below).  The general approach can be
specifically developed, with resampling/cross-validation, examination of
calibration curves, and so on.  Frank E. Harrell, Jr., has written a good
resource for approaches to modeling, _Regression Modeling Strategies_ (NY:
Springer-Verlag, 2001).

(-genbinomial- is user-written, and so you'll need to install it if you haven't
already.  -findit genbinomial- will point you to its location.)

Joseph Coveney

clear *
set more off

set seed `=date("2009-06-06", "YMD")'
set obs 250

generate double x = runiform()
replace x = 0 in 1/`=0.6 * 250'
generate double xb = -1 + 0.5 * x

set seed0 `=date("2009-06-06", "YMD")'

/*  a) Divide X into m categories and represent it with m-1 dummy variables in
the model. */
egen byte x_categorized = cut(x), at(0(0.2)1)
tabulate x_categorized, generate(x_)
logistic y x_2-x_5
estimates store a

/* Include X in the model, and also include a binary variable Z such that Z=1
when X=0 and Z=0 otherwise. */
generate byte z = x == 0
logistic y x z
estimates store b

lrtest a b, stats
* b wins

/* A third option.  One that you might have rejected out-of-hand. */
logistic y x
estimates store c

lrtest b c, stats
* c wins

Exit

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```