Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: logistic regression predictors

 From Steven Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: logistic regression predictors Date Mon, 19 Jul 2010 11:19:14 -0400

```--

```
In fact, if you feed the two-time data to -cart-, as I suggested, the log rank test in -cart- (which is -stcox- with the breslow option for ties), will be equivalent to the stratified mantel-haenzel test for binary data. Thus -cart- will provide a defensible split for binary data. This splitting algorithm not equivalent to that in the original CART method; also -cart- does not prune its trees and so risks over- splitting. It continues to split if there are at least enough events; if there is a split which enough observations on each side of the split; and if a pvalue adjusted for multiple comparisons is too small. The minimum required numbers of events and observations are set by the minfail() and minsize() options; the default values are 10. The default pvalue is 0.05, also settable.
```
```
To calculate the error rate You have to identify the observations in each final node; classify each observation according to whether the proportion of events in the node is >.5 or <.5; then compute the percent of correct classifications overall (also, for each node if you wish, but these will not be too precise.)
```
Steve

On Jul 18, 2010, at 11:59 AM, Steve Samuels wrote:

I was wrong about the utility of -cart- and -boost- for your data.
--boost- is not useful when the predictors are indicator variables, as
yours seem to be. (You haven't given many details). -cart- is intended
for failure time data, not binary data.

With a small number of predictors, you might be able to do a
classification tree "by hand".   -cart- might guide you to a possible
tree: simply set up two times: a shorter one for deaths and a longer
one for survivors. -cart- will show  the numbers of cases and failures
and each terminal node.   The error rate will be optimistic, because
it is measured on the same data used to form the tree. To get a more
accurate error rate, you could also manually do a cross-validation.
Most simply, randomly split your data into a "training" and "test"
sets.  Develop your tree on the training set, and estimate it's
accuracy (percent correctly predicted) on your "test" set.  This can
be improved by k-fold cross-validation  Randomly divide your data into
k (say 10) sets, omit one at a time, do -cart- on the remainder and
test the resulting prediction on the omitted set.  Your estimate of
prediction error is the average of the 10.

I also suggest that you also look at the counts, deaths, and rates for
all   combinations of your predictors.  See -crp- by Nick Cox,

Steve

```
On Sun, Jul 18, 2010 at 10:24 AM, Steve Samuels <sjsamuels@gmail.com> wrote:
```With such a strong independently predictive group, logistic regression
will give poor predictions, because it assumes that all variables are
needed to predict for each individual. The solution is a tree-based
approach. The original reference is Breiman, L., J. H. Friedman, R. A.
```
Olshen, and C. J. Stone. 1984. Classiﬁcation and Regression Trees. New
```York: Chapman & Hall/CRC. Apparent Stata solutions are -boost-
("findit boost") and -cart- (from SSC). I say "apparent", because I've
not closely read the documentation for either. Non-commercial
solutions can be found in R and at
http://www.stat.wisc.edu/~loh/guide.html.

Steve

--
Steven Samuels
sjsamuels@gmail.com
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax:    206-202-4783

```
On Sun, Jul 18, 2010 at 1:57 AM, lilian tesmann <lilian_tes@hotmail.com > wrote:
```Dear All,

```
I am trying to predict mortality rates in a specific population of clients. I encountered two problems and would be really grateful for any insights or suggestions.
```
```
(1) We have one predictor – a health condition, which is present in only 5% of population but over70% of people with that condition die. Not surprisingly OR is very large (from 25 to 50). The purpose of the analysis is to obtain individual predictions, but they are hugely influenced by this health condition. Could anyone suggest how to deal with this problem?
```
```
(2) Another problem is that in this very specific clinical population another two health conditions, which are usually very significant predictors of death, have OR=0.3-0.5. The result it has on prediction is that according to my model, sicker people have a lower risk of dying. It looks to me as a collinearity issue between predictors and our inclusion/exclusion criteria which created this population. What do I do in this situation? We cannot change inclusion criteria and we have only a small number of predictors, three of them with ‘behavior problems’.
```
```
```
```
```

--
Steven Samuels
sjsamuels@gmail.com
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax:    206-202-4783

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```