Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: time efficient way to choose variables


From   jverkuilen <jverkuilen@gc.cuny.edu>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: time efficient way to choose variables
Date   Wed, 4 Feb 2009 09:12:42 -0500

As others have noted, this is a variant of the long discredited stepwise regression. 

There are better automatic variable selection procedures developed by the machine learning people that go under colorful names like bagging and boosting. These all use some kind of cross-validation or bootstrapping to protect against capitalization on chance that older stepwise procedures are very susceptible to. I don't think they are implemented in Stata, but maybe someone has. See, e.g., T Hastie, R Tibshirani, J Friedman. 2000. Elements of statistical learning. Springer. 

Model averaging is another approach. This pools predictions from models using weights derived from goodness of fit measures, again protecting against capitalization on chance by using bootstrapping of some sort. See, e.g., KA Burnham and D Anderson. 2003. Model selection and multimodel inference, 2nd Ed. Springer. 



-----Original Message-----
From: "Hardy, Dale S" <Dale.S.Hardy@uth.tmc.edu>
To: statalist@hsphsun2.harvard.edu
Sent: 2/3/2009 10:21 PM
Subject: st: time efficient way to choose variables

I have data in which I want to pick out variables associated with
developing a disease. Each time I run the foreach command with the
covariates, I cut out the one variable with the highest Z value with p
value <0.05, and I put this variable in the second equation (stcox)
until I have no variables with p value <0.05 left when I run the models
with the foreach command. 

Here is an example below:

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
sizeband pnnumb grade_s lung4 comorbid treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 `var`
}

Then I choose the variable with the  highest z score with p value <0.05
Then run the model again. Comorbid is taken out because of its highest Z
score and placed in the second equation.

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
sizeband pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 comorbid  `var`
}

Third run:
Sizeband was chosen because of the highest Z score with p value <0.05
This was placed in the second model:

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 comorbid sizeband `var`
}

I do this until there is no more variables with p value <0.05 to choose
from.

1. My question is how can I do this process very quickly and time
efficient.
Do I use an array? Can you show me how to do this?

2. Is there also a time efficient process in looking for effect
modifiers using several variables (one at a time in separate models)
using the likelihood ratio test?


Thanks.



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index