# Re: st: stepwise

 From Richard Williams To statalist@hsphsun2.harvard.edu Subject Re: st: stepwise Date Mon, 04 Sep 2006 09:20:29 -0500

```At 04:30 AM 9/4/2006, Timothy.Mak@iop.kcl.ac.uk wrote:

```
```stepwise regression is needed. Say we have n = 200, and a potential pool
of predictors = 50, say that each of these 50 predictors have 1 or 2
missing, not necesarily randomly. Using the Stata stepwise procedure, we
may well end up with a final model with some 5 variables, but this model
was derived only using around 75% of the sample, and most likely not a
random sample. Would it not be wiser to use all available observations at
each try? Intuitively I feel that this final model might be less biased
because it does not involve throwing as much information away (1% vs 25%),
although I believe mathematically this would be quite difficult to prove.
```
One of the concerns with stepwise is that a different sample could easily lead to different variables being selected. That concern would seem to be even greater with a small sample, where the estimates are going to be less precise, i.e. two different samples of 200 could easily lead to two different sets of variables being selected, especially if a lot of variables are close to each other in their correlations.

Your suggested procedure might make this even worse. Suppose X1 and X2 both have 20% missing data and it is a different 20% for each variable. X1 barely edges out X2 in step 1. The sample for subsequent steps will be quite different than it would be if X2 had barely edged out X1.

Anyway, I would say that, if you are really concerned about the effects of missing data, then try to do something about it. If you don't want to get too fancy about it, perhaps even simple mean substitution (which has its own problems) would be better than nothing at all. See the -impute- command.

I assume it would be possible to write a stepwise procedure that behaved like you would like, but it might take a long time to run. Regular stepwise can be done just by working off a correlation matrix. With your approach, the correlation matrix would be a moving target as the sample changed, so I imagine you'd have to do a lot more calculations.

```Richard, re-running step-wise on the selected model does not produce the
same results, does it?
```
Hopefully it would, but there is no guarantee. If there are big differences, this may underscore the problems you have with MD or the problems with using a stepwise procedure where small differences in variable correlations could produce very different models. For example, in the full sample, X1 might barely edge out X2, but in a sample where MD has been eliminated X2 might barely edge out X1.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: Richard.A.Williams.5@ND.Edu
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/