[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
jverkuilen <jverkuilen@gc.cuny.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: time efficient way to choose variables |

Date |
Wed, 4 Feb 2009 09:12:42 -0500 |

As others have noted, this is a variant of the long discredited stepwise regression. There are better automatic variable selection procedures developed by the machine learning people that go under colorful names like bagging and boosting. These all use some kind of cross-validation or bootstrapping to protect against capitalization on chance that older stepwise procedures are very susceptible to. I don't think they are implemented in Stata, but maybe someone has. See, e.g., T Hastie, R Tibshirani, J Friedman. 2000. Elements of statistical learning. Springer. Model averaging is another approach. This pools predictions from models using weights derived from goodness of fit measures, again protecting against capitalization on chance by using bootstrapping of some sort. See, e.g., KA Burnham and D Anderson. 2003. Model selection and multimodel inference, 2nd Ed. Springer. -----Original Message----- From: "Hardy, Dale S" <Dale.S.Hardy@uth.tmc.edu> To: statalist@hsphsun2.harvard.edu Sent: 2/3/2009 10:21 PM Subject: st: time efficient way to choose variables I have data in which I want to pick out variables associated with developing a disease. Each time I run the foreach command with the covariates, I cut out the one variable with the highest Z value with p value <0.05, and I put this variable in the second equation (stcox) until I have no variables with p value <0.05 left when I run the models with the foreach command. Here is an example below: foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 sizeband pnnumb grade_s lung4 comorbid treat2r xrt3 seer1 dxyear_cate { stcox PAC1 `var` } Then I choose the variable with the highest z score with p value <0.05 Then run the model again. Comorbid is taken out because of its highest Z score and placed in the second equation. foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 sizeband pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate { stcox PAC1 comorbid `var` } Third run: Sizeband was chosen because of the highest Z score with p value <0.05 This was placed in the second model: foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate { stcox PAC1 comorbid sizeband `var` } I do this until there is no more variables with p value <0.05 to choose from. 1. My question is how can I do this process very quickly and time efficient. Do I use an array? Can you show me how to do this? 2. Is there also a time efficient process in looking for effect modifiers using several variables (one at a time in separate models) using the likelihood ratio test? Thanks. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: time efficient way to choose variables***From:*"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu>

**RE: st: time efficient way to choose variables***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: RE: Inserting a number into existing study ids** - Next by Date:
**Re: st: RE: Using Value Labels to Label Variables** - Previous by thread:
**Re: st: time efficient way to choose variables** - Next by thread:
**RE: st: time efficient way to choose variables** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |