Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: time efficient way to choose variables


From   Steven Samuels <sjhsamuels@earthlink.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: time efficient way to choose variables
Date   Wed, 4 Feb 2009 18:27:03 -0500

A google search on "austin tu bootstrap stepwise" turned up this:

Austin, P. and Tu, J. (2004). Bootstrap methods for developing predictive models, The American Statistician, 58, 131–137.

-Steve
On Feb 4, 2009, at 3:04 PM, Hardy, Dale S wrote:

Tony,

Can you send me the reference to this paper.

Thanks.

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Lachenbruch,
Peter
Sent: Wednesday, February 04, 2009 1:53 PM
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: time efficient way to choose variables

The lasso and LARS methods are also possible for this purpose.  Stata
has a LARS ado written by Adrian Mander - it also does the lasso.

A recent paper (2004) by Austin and Tu discusses using bootstrapping in
conjunction with stepwise regression - they sense of their article is
that the variables selected gives a hint at the frequency of the
selection distribution.

An interesting variant is to combine this with missing values...

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of jverkuilen
Sent: Wednesday, February 04, 2009 6:13 AM
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: time efficient way to choose variables

As others have noted, this is a variant of the long discredited stepwise
regression.

There are better automatic variable selection procedures developed by
the machine learning people that go under colorful names like bagging
and boosting. These all use some kind of cross-validation or
bootstrapping to protect against capitalization on chance that older
stepwise procedures are very susceptible to. I don't think they are
implemented in Stata, but maybe someone has. See, e.g., T Hastie, R
Tibshirani, J Friedman. 2000. Elements of statistical learning.
Springer.

Model averaging is another approach. This pools predictions from models
using weights derived from goodness of fit measures, again protecting
against capitalization on chance by using bootstrapping of some sort.
See, e.g., KA Burnham and D Anderson. 2003. Model selection and
multimodel inference, 2nd Ed. Springer.



-----Original Message-----
From: "Hardy, Dale S" <Dale.S.Hardy@uth.tmc.edu>
To: statalist@hsphsun2.harvard.edu
Sent: 2/3/2009 10:21 PM
Subject: st: time efficient way to choose variables

I have data in which I want to pick out variables associated with
developing a disease. Each time I run the foreach command with the
covariates, I cut out the one variable with the highest Z value with p
value <0.05, and I put this variable in the second equation (stcox)
until I have no variables with p value <0.05 left when I run the models
with the foreach command.

Here is an example below:

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
sizeband pnnumb grade_s lung4 comorbid treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 `var`
}

Then I choose the variable with the highest z score with p value <0.05 Then run the model again. Comorbid is taken out because of its highest Z
score and placed in the second equation.

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
sizeband pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 comorbid  `var`
}

Third run:
Sizeband was chosen because of the highest Z score with p value <0.05
This was placed in the second model:

foreach var of varlist agegrp racecode1 s_sex1 ses_pov ajcc6seer6_1
pnnumb grade_s lung4 treat2r xrt3 seer1 dxyear_cate {
stcox PAC1 comorbid sizeband `var`
}

I do this until there is no more variables with p value <0.05 to choose
from.

1. My question is how can I do this process very quickly and time
efficient.
Do I use an array? Can you show me how to do this?

2. Is there also a time efficient process in looking for effect
modifiers using several variables (one at a time in separate models)
using the likelihood ratio test?


Thanks.



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index