# Re: st: stepwise

 From "Stas Kolenikov" To statalist@hsphsun2.harvard.edu Subject Re: st: stepwise Date Tue, 5 Sep 2006 09:23:38 -0500

On 9/4/06, Timothy.Mak@iop.kcl.ac.uk <Timothy.Mak@iop.kcl.ac.uk> wrote:

Thank you Nick and Richard for their comments. I have also read Sribney's
comments on the pitfalls of stepwise regression, and I confess it's an
eye-opener. However I do seem to remeber seeing arguments for stepwise
regression, especially concerning the use of too many predictor variables
in logistic regression. I don't think there's need for a discussion of the
place of stepwise regression on Statalist now, but I just thought I'd give
my final comment. Thanks again for all who replied.

The main argument for the stepwise regression as the means of model
selection is apparent ease of implementation once you have the
standard regression toolbox, like -regress- or -logit- or whatever.
Other means of model selection, however, are abundant, but require
interpretation of the results.

One popular method is the use of information criteria, like AIC and
BIC available through -ereturn list-. Bascially you need to run all
2^p regressions and choose the one that gives you the smallest
information criteria statistic.

Another big strand of literature on model selection deals with various
shrinkage estimators. Shrinkage typically means that the estimation
procedure intentionally biases the parameters towards zero, as zero
slope would also mean that the parameter is excluded from the
regression. Among the tons of existing methods, ridge regression is
probably the oldest one (dates back into 70s), and is implemented in
Stata by Bill Gould back in Stata 3 or so. The shrinkage is performed

b_ridge(c) = (X'X + c I)^{-1} X'Y

where c is a positive tuning constant. With c=0, one is back to OLS.
With c = infinity, all estimates are zeroes. Intermediate values of c
give some sort of a curve, and one looks for a spot where the
curvature is the greatest, which means the estimation shfits from
something dictated by the data X'X to something dictated by the
dominant cI term in the brackets.

The latest and apparently coolest of the shrinkage methods is called a
lasso, with the idea that you throw a lasso on the set of regression
coefficients:

b_lasso(c) = arg min[Residual sum of squares] s.t. \sum_{k=1}^p | b_k| <= c

that is, the sum of absolute values of the regression slopes is
constrained. When c = 0, all coefficients have to be zero, and that is
essentially your starting point of model selection. As you relax the
constraint, the slopes start popping out one by one, the most
important in their explanatory power going first. When c -> \infty,
you of course get back your old friend OLS. Thus as you plot your
coefficients against c, you would see a trace plot with regression
slopes branching off from zero.

For the above methods, you get a graphical answer with a (potentially)
continuum of the models parameterized by the tuning parameter c.
Making a choice between them requires some experience with the model
selection method, or a more formal cross-validation for different
values of c.

Finally, there is a wealth of Bayesian model selection procedures
based on Markov chain Monte Carlo methods. For a basic MCMC approach,
you sample parameters from a posterior distribution (or something you
claim is a posterior distribution once the chain has converged) to get
your parameter estimates and conduct inference. This can be augmented
in two ways: by allowing more weird priors, or by allowing various
models to be visited by the chain. To allow for an identical zero
value of a regression parameter, one can specify a prior with a point
mass at zero, plus some diffuse continuous distribution around it. For
"important" parameters, you would see that the fraction of the zero
point mass goes down; for "unimportant" parameters, it should go up.
With somewhat more advanced Metropolis-Hastings algorithms, you can
models, like adding or taking away regressors, or changing the
structural form to something non-linear, or to allow correct
specification of the nonlinear terms (like inclusion of all the lower
order polynomials: if you have something like x^l in the model, you
should also have all x^k, k<=l; a naive stepwise procedure may have
some of the lower order terms dropped out). The special concern is to
make all those transition between models invertible, so that the
convergence properties of a Markov chain are not affected. Then you
perform model selection by looking at the models that the chain (or
better a set of parallel chains) had visited most often, or by
computing the fraction of times a certain parameter is not identically
zero, etc.

Harrell is a nice book that deals a lot with the multiple testing
issue, of which model selection is one of the most important examples.
Hastie, Tibshirani and Friedman is my other favorite on the topic,
although it is written from a totally different perspective of
statistical learning (of which again model selection is a substantial
chunk).

--
Stas Kolenikov
http://stas.kolenikov.name
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


• Follow-Ups:
• RE: st: stepwise
• From: "Feiveson, Alan H. (JSC-SK311)" <alan.h.feiveson@nasa.gov>