[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: stepwise
At 10:15 AM 9/1/2006, Timothy.Mak@iop.kcl.ac.uk wrote:
As to why it happens - I believe both stepwise and nestreg do
listwise deletion based on all the vars you have specified. Hence, a
variable with missing data that doesn't make the final cut can still
cause your sample size to be reduced.
I don't see why you can't work with all available data at each try.
Arguably there is the down side that you're comparing models with
different number of observations. But it just bothers me that at the end
of the day I have a final model that doesn't have the same results as when
I simply enter the variables. Moreover, if there are lots of variables, we
may end up running the procedure on only half of the data, which is a bit
stupid I think. Alan, thanks for the suggestion of multiple imputation,
but that's not my concern at the moment, because I won't be using it
simply because it's too complicated. In any case, how do you run stepwise
regression on several different imputed datasets and decide on one final
one at the end?
I don't know this for a fact, but it wouldn't surprise me if listwise
deletion is a universal or near-universal practice for stepwise
programs. I see your point, but how exactly would the alternative
work? Consider some practical problems:
* Suppose X1 has 100% data but X2 has 50% missing. If you are going
by p values, X1 has a built in advantage just because its N is
larger. This can also be true of all subsequent variables examined;
a larger N gives one variable an advantage over others.
* The formulas used are (or can be) based on comparisons of model
fit, e.g. how much does the addition of one or more variables cause
the residual sums of squares to go down? Such comparisons should be
based on the same sample being used. If the sample size is suddenly
cut in half, you'll see big differences in the residual sums of
squares, but that is primarily because the sample is much smaller
once you add the new variable.
* Suppose data are missing non-randomly. To take an extreme case,
suppose X2 has so much missing data because it was only asked of
females. X1 might enter first in the full sample, but not enter
first (or even at all) in the sample with women only. . This point
especially concerns me - while you may be able to come up with
alternative formulas that get around the first 2 problems, this
problem would seem to be there regardless.
* In short, the whole purpose of stepwise is to select variables
based on comparisons of effects. But, it is hard to make and justify
comparisons of variable effects if the sample analyzed is a moving
target as you go from one variable to the next.
There would also seem to be workarounds for your concern. If X1 and
X2 were the lucky survivors from stepwise, then just run a new
regression using X1 and X2; or perhaps run a new stepwise using only
those variables. If all is well, you presumably will find that the
coefficients are about the same as they were with the original
stepwise but they are more significant because of the higher N. If
coefficients are quite a big different, then that may suggest
systematic biases in your missing data.
Incidentally, I am ignoring for now all the concerns that can be
raised about stepwise regression! But the alternative you have in
mind, I think, would simply add to those concerns. If anyone has
contrary views or knows of stepwise routines that behave like Timothy
would like, I'd be interested in hearing about them.
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
* For searches and help try: