Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

lilian tesmann <lilian_tes@hotmail.com> |

<statalist@hsphsun2.harvard.edu> |

RE: st: logistic regression predictors |

Tue, 20 Jul 2010 09:47:28 +1030 |

Thank you Steve for the idea and useful instructions. This sounds like a promising way to go. Lilian > From: sjhsamuels@earthlink.net > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: logistic regression predictors > Date: Mon, 19 Jul 2010 11:19:14 -0400 > > -- > > In fact, if you feed the two-time data to -cart-, as I suggested, the > log rank test in -cart- (which is -stcox- with the breslow option for > ties), will be equivalent to the stratified mantel-haenzel test for > binary data. Thus -cart- will provide a defensible split for binary > data. This splitting algorithm not equivalent to that in the original > CART method; also -cart- does not prune its trees and so risks over- > splitting. It continues to split if there are at least enough events; > if there is a split which enough observations on each side of the > split; and if a pvalue adjusted for multiple comparisons is too > small. The minimum required numbers of events and observations are > set by the minfail() and minsize() options; the default values are 10. > The default pvalue is 0.05, also settable. > > To calculate the error rate You have to identify the observations in > each final node; classify each observation according to whether the > proportion of events in the node is>.5 or <.5; then compute the > percent of correct classifications overall (also, for each node if you > wish, but these will not be too precise.) > > Steve > > > On Jul 18, 2010, at 11:59 AM, Steve Samuels wrote: > > > I was wrong about the utility of -cart- and -boost- for your data. > --boost- is not useful when the predictors are indicator variables, as > yours seem to be. (You haven't given many details). -cart- is intended > for failure time data, not binary data. > > With a small number of predictors, you might be able to do a > classification tree "by hand". -cart- might guide you to a possible > tree: simply set up two times: a shorter one for deaths and a longer > one for survivors. -cart- will show the numbers of cases and failures > and each terminal node. The error rate will be optimistic, because > it is measured on the same data used to form the tree. To get a more > accurate error rate, you could also manually do a cross-validation. > Most simply, randomly split your data into a "training" and "test" > sets. Develop your tree on the training set, and estimate it's > accuracy (percent correctly predicted) on your "test" set. This can > be improved by k-fold cross-validation Randomly divide your data into > k (say 10) sets, omit one at a time, do -cart- on the remainder and > test the resulting prediction on the omitted set. Your estimate of > prediction error is the average of the 10. > > I also suggest that you also look at the counts, deaths, and rates for > all combinations of your predictors. See -crp- by Nick Cox, > downloadable from SSC. > > Steve > > On Sun, Jul 18, 2010 at 10:24 AM, Steve Samuels <sjsamuels@gmail.com> > wrote: >> With such a strong independently predictive group, logistic regression >> will give poor predictions, because it assumes that all variables are >> needed to predict for each individual. The solution is a tree-based >> approach. The original reference is Breiman, L., J. H. Friedman, R. A. >> Olshen, and C. J. Stone. 1984. Classiﬁcation and Regression Trees. >> New >> York: Chapman & Hall/CRC. Apparent Stata solutions are -boost- >> ("findit boost") and -cart- (from SSC). I say "apparent", because I've >> not closely read the documentation for either. Non-commercial >> solutions can be found in R and at >> http://www.stat.wisc.edu/~loh/guide.html. >> >> >> Steve >> >> -- >> Steven Samuels >> sjsamuels@gmail.com >> 18 Cantine's Island >> Saugerties NY 12477 >> USA >> Voice: 845-246-0774 >> Fax: 206-202-4783 >> >> >> >> On Sun, Jul 18, 2010 at 1:57 AM, lilian tesmann <lilian_tes@hotmail.com >>> wrote: >>> Dear All, >>> >>> I am trying to predict mortality rates in a specific population of >>> clients. >>> I encountered two problems and would be really grateful for any >>> insights or suggestions. >>> >>> (1) We have one predictor – a health condition, which is present >>> in only 5% of population but over70% of people with that condition >>> die. Not surprisingly OR is very large (from 25 to 50). The purpose >>> of the analysis is to obtain individual predictions, but they are >>> hugely influenced by this health condition. Could anyone suggest >>> how to deal with this problem? >>> >>> (2) Another problem is that in this very specific clinical >>> population another two health conditions, which are usually very >>> significant predictors of death, have OR=0.3-0.5. The result it has >>> on prediction is that according to my model, sicker people have a >>> lower risk of dying. It looks to me as a collinearity issue between >>> predictors and our inclusion/exclusion criteria which created this >>> population. What do I do in this situation? We cannot change
inclusion criteria and we have only a small number of predictors,
three of them with 'behavior problems'.

