# Re: st: stepwise

 From Richard Williams To statalist@hsphsun2.harvard.edu Subject Re: st: stepwise Date Fri, 01 Sep 2006 15:54:49 -0500

```At 10:15 AM 9/1/2006, Timothy.Mak@iop.kcl.ac.uk wrote:
```
```I don't see why you can't work with all available data at each try.
Arguably there is the down side that you're comparing models with
different number of observations. But it just bothers me that at the end
of the day I have a final model that doesn't have the same results as when
I simply enter the variables. Moreover, if there are lots of variables, we
may end up running the procedure on only half of the data, which is a bit
stupid I think. Alan, thanks for the suggestion of multiple imputation,
but that's not my concern at the moment, because I won't be using it
simply because it's too complicated. In any case, how do you run stepwise
regression on several different imputed datasets and decide on one final
one at the end?
```
As to why it happens - I believe both stepwise and nestreg do listwise deletion based on all the vars you have specified. Hence, a variable with missing data that doesn't make the final cut can still cause your sample size to be reduced.

I don't know this for a fact, but it wouldn't surprise me if listwise deletion is a universal or near-universal practice for stepwise programs. I see your point, but how exactly would the alternative work? Consider some practical problems:

* Suppose X1 has 100% data but X2 has 50% missing. If you are going by p values, X1 has a built in advantage just because its N is larger. This can also be true of all subsequent variables examined; a larger N gives one variable an advantage over others.

* The formulas used are (or can be) based on comparisons of model fit, e.g. how much does the addition of one or more variables cause the residual sums of squares to go down? Such comparisons should be based on the same sample being used. If the sample size is suddenly cut in half, you'll see big differences in the residual sums of squares, but that is primarily because the sample is much smaller once you add the new variable.

* Suppose data are missing non-randomly. To take an extreme case, suppose X2 has so much missing data because it was only asked of females. X1 might enter first in the full sample, but not enter first (or even at all) in the sample with women only. . This point especially concerns me - while you may be able to come up with alternative formulas that get around the first 2 problems, this problem would seem to be there regardless.

* In short, the whole purpose of stepwise is to select variables based on comparisons of effects. But, it is hard to make and justify comparisons of variable effects if the sample analyzed is a moving target as you go from one variable to the next.

There would also seem to be workarounds for your concern. If X1 and X2 were the lucky survivors from stepwise, then just run a new regression using X1 and X2; or perhaps run a new stepwise using only those variables. If all is well, you presumably will find that the coefficients are about the same as they were with the original stepwise but they are more significant because of the higher N. If coefficients are quite a big different, then that may suggest systematic biases in your missing data.

Incidentally, I am ignoring for now all the concerns that can be raised about stepwise regression! But the alternative you have in mind, I think, would simply add to those concerns. If anyone has contrary views or knows of stepwise routines that behave like Timothy would like, I'd be interested in hearing about them.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: Richard.A.Williams.5@ND.Edu
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/