The lasso and LARS methods are also possible for this purpose. Stata has a LARS ado written by Adrian Mander - it also does the lasso. A recent paper (2004) by Austin and Tu discusses using bootstrapping in conjunction with stepwise regression - they sense of their article is that the variables selected gives a hint at the frequency of the selection distribution. An interesting variant is to combine this with missing values... Tony Peter A. Lachenbruch Department of Public Health Oregon State University Corvallis, OR 97330 Phone: 541-737-3832 FAX: 541-737-4001 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of jverkuilen Sent: Wednesday, February 04, 2009 6:13 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: time efficient way to choose variables As others have noted, this is a variant of the long discredited stepwise regression. There are better automatic variable selection procedures developed by the machine learning people that go under colorful names like bagging and boosting. These all use some kind of cross-validation or bootstrapping to protect against capitalization on chance that older stepwise procedures are very susceptible to. I don't think they are implemented in Stata, but maybe someone has. See, e.g., T Hastie, R Tibshirani, J Friedman. 2000. Elements of statistical learning. Springer. Model averaging is another approach. This pools predictions from models using weights derived from goodness of fit measures, again protecting against capitalization on chance by using bootstrapping of some sort. See, e.g., KA Burnham and D Anderson. 2003. Model selection and multimodel inference, 2nd Ed. Springer. -----Original Message----- From: "Hardy, Dale S" <Dale.S.Hardy@uth.tmc.edu> To: statalist@hsphsun2.harvard.edu Sent: 2/3/2009 10:21 PM Subject: st: time efficient way to choose variables I have data in which I want to pick out variables associated with developing a disease. Each time I run the foreach command with the covariates, I cut out the one variable with the highest Z value with p value <0.05, and I put this variable in the second equation (stcox) until I have no variables with p value <0.05 left when I run the models with the foreach command. Here is an example below: foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 sizeband pnnumb grade_s lung4 comorbid treat2r xrt3 seer1 dxyear_cate { stcox PAC1 `var` } Then I choose the variable with the highest z score with p value <0.05 Then run the model again. Comorbid is taken out because of its highest Z score and placed in the second equation. foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 sizeband pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate { stcox PAC1 comorbid `var` } Third run: Sizeband was chosen because of the highest Z score with p value <0.05 This was placed in the second model: foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1 pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate { stcox PAC1 comorbid sizeband `var` } I do this until there is no more variables with p value <0.05 to choose from. 1. My question is how can I do this process very quickly and time efficient. Do I use an array? Can you show me how to do this? 2. Is there also a time efficient process in looking for effect modifiers using several variables (one at a time in separate models) using the likelihood ratio test? Thanks. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

