Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: logistic regression predictors

From   Steve Samuels <>
Subject   Re: st: logistic regression predictors
Date   Sun, 18 Jul 2010 11:59:30 -0400

I was wrong about the utility of -cart- and -boost- for your data.
--boost- is not useful when the predictors are indicator variables, as
yours seem to be. (You haven't given many details). -cart- is intended
for failure time data, not binary data.

With a small number of predictors, you might be able to do a
classification tree "by hand".   -cart- might guide you to a possible
tree: simply set up two times: a shorter one for deaths and a longer
one for survivors. -cart- will show  the numbers of cases and failures
and each terminal node.   The error rate will be optimistic, because
it is measured on the same data used to form the tree. To get a more
accurate error rate, you could also manually do a cross-validation.
Most simply, randomly split your data into a "training" and "test"
sets.  Develop your tree on the training set, and estimate it's
accuracy (percent correctly predicted) on your "test" set.  This can
be improved by k-fold cross-validation  Randomly divide your data into
k (say 10) sets, omit one at a time, do -cart- on the remainder and
test the resulting prediction on the omitted set.  Your estimate of
prediction error is the average of the 10.

I also suggest that you also look at the counts, deaths, and rates for
 all   combinations of your predictors.  See -crp- by Nick Cox,
downloadable from SSC.


On Sun, Jul 18, 2010 at 10:24 AM, Steve Samuels <> wrote:
> With such a strong independently predictive group, logistic regression
> will give poor predictions, because it assumes that all variables are
> needed to predict for each individual. The solution is a tree-based
> approach. The original reference is Breiman, L., J. H. Friedman, R. A.
> Olshen, and C. J. Stone. 1984. Classification and Regression Trees. New
> York: Chapman & Hall/CRC. Apparent Stata solutions are -boost-
> ("findit boost") and -cart- (from SSC). I say "apparent", because I've
> not closely read the documentation for either. Non-commercial
> solutions can be found in R and at
> Steve
> --
> Steven Samuels
> 18 Cantine's Island
> Saugerties NY 12477
> Voice: 845-246-0774
> Fax:    206-202-4783
> On Sun, Jul 18, 2010 at 1:57 AM, lilian tesmann <> wrote:
>> Dear All,
>> I am trying to predict mortality rates in a specific population of clients.
>> I encountered two problems and would be really grateful for any insights or suggestions.
>> (1) We have one predictor – a health condition, which is present in only 5% of population but over70% of people with that condition die. Not surprisingly OR is very large (from 25 to 50). The purpose of the analysis is to obtain individual predictions, but they are hugely influenced by this health condition. Could anyone suggest how to deal with this problem?
>> (2) Another problem is that in this very specific clinical population another two health conditions, which are usually very significant predictors of death, have OR=0.3-0.5. The result it has on prediction is that according to my model, sicker people have a lower risk of dying. It looks to me as a collinearity issue between predictors and our inclusion/exclusion criteria which created this population. What do I do in this situation? We cannot change inclusion criteria and we have only a small number of predictors, three of them with ‘behavior problems’.

Steven Samuels
18 Cantine's Island
Saugerties NY 12477
Voice: 845-246-0774
Fax:    206-202-4783

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index