Thanks for the nice review, Stas. But just to stick in my two-cents
worth, there is a place for doing all-subsets (not necessarily stepwise)
regressions if you have a purely deterministic response variable and you
are just trying to find a parsimonious approximation to it based on a
linear combination of candidate variables. In this case there is no
"true" model or hypothsis testing issue since there is no probability
model. This situation could arise if the response is expensive to
measure, but the predictor variables are cheap to observe.
Al F.
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Stas
Kolenikov
Sent: Tuesday, September 05, 2006 9:24 AM
To: [email protected]
Subject: Re: st: stepwise
On 9/4/06, [email protected] <[email protected]> wrote:
> Thank you Nick and Richard for their comments. I have also read
> Sribney's comments on the pitfalls of stepwise regression, and I
> confess it's an eye-opener. However I do seem to remeber seeing
> arguments for stepwise regression, especially concerning the use of
> too many predictor variables in logistic regression. I don't think
> there's need for a discussion of the place of stepwise regression on
> Statalist now, but I just thought I'd give my final comment. Thanks
again for all who replied.
The main argument for the stepwise regression as the means of model
selection is apparent ease of implementation once you have the standard
regression toolbox, like -regress- or -logit- or whatever.
Other means of model selection, however, are abundant, but require more
customized programming and a user more informed of the interpretation of
the results.
One popular method is the use of information criteria, like AIC and BIC
available through -ereturn list-. Bascially you need to run all 2^p
regressions and choose the one that gives you the smallest information
criteria statistic.
Another big strand of literature on model selection deals with various
shrinkage estimators. Shrinkage typically means that the estimation
procedure intentionally biases the parameters towards zero, as zero
slope would also mean that the parameter is excluded from the
regression. Among the tons of existing methods, ridge regression is
probably the oldest one (dates back into 70s), and is implemented in
Stata by Bill Gould back in Stata 3 or so. The shrinkage is performed by
adding an augmenting matrix to X'X leading to
b_ridge(c) = (X'X + c I)^{-1} X'Y
where c is a positive tuning constant. With c=0, one is back to OLS.
With c = infinity, all estimates are zeroes. Intermediate values of c
give some sort of a curve, and one looks for a spot where the curvature
is the greatest, which means the estimation shfits from something
dictated by the data X'X to something dictated by the dominant cI term
in the brackets.
The latest and apparently coolest of the shrinkage methods is called a
lasso, with the idea that you throw a lasso on the set of regression
coefficients:
b_lasso(c) = arg min[Residual sum of squares] s.t. \sum_{k=1}^p | b_k|
<= c
that is, the sum of absolute values of the regression slopes is
constrained. When c = 0, all coefficients have to be zero, and that is
essentially your starting point of model selection. As you relax the
constraint, the slopes start popping out one by one, the most important
in their explanatory power going first. When c -> \infty, you of course
get back your old friend OLS. Thus as you plot your coefficients against
c, you would see a trace plot with regression slopes branching off from
zero.
For the above methods, you get a graphical answer with a (potentially)
continuum of the models parameterized by the tuning parameter c.
Making a choice between them requires some experience with the model
selection method, or a more formal cross-validation for different values
of c.
Finally, there is a wealth of Bayesian model selection procedures based
on Markov chain Monte Carlo methods. For a basic MCMC approach, you
sample parameters from a posterior distribution (or something you claim
is a posterior distribution once the chain has converged) to get your
parameter estimates and conduct inference. This can be augmented in two
ways: by allowing more weird priors, or by allowing various models to be
visited by the chain. To allow for an identical zero value of a
regression parameter, one can specify a prior with a point mass at zero,
plus some diffuse continuous distribution around it. For "important"
parameters, you would see that the fraction of the zero point mass goes
down; for "unimportant" parameters, it should go up.
With somewhat more advanced Metropolis-Hastings algorithms, you can also
add steps to your Markov chain that allow transition between models,
like adding or taking away regressors, or changing the structural form
to something non-linear, or to allow correct specification of the
nonlinear terms (like inclusion of all the lower order polynomials: if
you have something like x^l in the model, you should also have all x^k,
k<=l; a naive stepwise procedure may have some of the lower order terms
dropped out). The special concern is to make all those transition
between models invertible, so that the convergence properties of a
Markov chain are not affected. Then you perform model selection by
looking at the models that the chain (or better a set of parallel
chains) had visited most often, or by computing the fraction of times a
certain parameter is not identically zero, etc.
Harrell is a nice book that deals a lot with the multiple testing issue,
of which model selection is one of the most important examples.
Hastie, Tibshirani and Friedman is my other favorite on the topic,
although it is written from a totally different perspective of
statistical learning (of which again model selection is a substantial
chunk).
--
Stas Kolenikov
http://stas.kolenikov.name
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/