|  |  | 
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: stepwise
At 04:30 AM 9/4/2006, [email protected] wrote:
stepwise regression is needed. Say we have n = 200, and a potential pool
of predictors = 50, say that each of these 50 predictors have 1 or 2
missing, not necesarily randomly. Using the Stata stepwise procedure, we
may well end up with a final model with some 5 variables, but this model
was derived only using around 75% of the sample, and most likely not a
random sample. Would it not be wiser to use all available observations at
each try? Intuitively I feel that this final model might be less biased
because it does not involve throwing as much information away (1% vs 25%),
although I believe mathematically this would be quite difficult to prove.
One of the concerns with stepwise is that a different sample could 
easily lead to different variables being selected.  That concern 
would seem to be even greater with a small sample, where the 
estimates are going to be less precise, i.e. two different samples of 
200 could easily lead to two different sets of variables being 
selected, especially if a lot of variables are close to each other in 
their correlations.
Your suggested procedure might make this even worse.  Suppose X1 and 
X2 both have 20% missing data and it is a different 20% for each 
variable.  X1 barely edges out X2 in step 1.  The sample for 
subsequent steps will be quite different than it would be if X2 had 
barely edged out X1.
Anyway, I would say that, if you are really concerned about the 
effects of missing data, then try to do something about it.  If you 
don't want to get too fancy about it, perhaps even simple mean 
substitution (which has its own problems) would be better than 
nothing at all.  See the -impute- command.
I assume it would be possible to write a stepwise procedure that 
behaved like you would like, but it might take a long time to 
run.  Regular stepwise can be done just by working off a correlation 
matrix.  With your approach, the correlation matrix would be a moving 
target as the sample changed, so I imagine you'd have to do a lot 
more calculations.
Richard, re-running step-wise on the selected model does not produce the
same results, does it?
Hopefully it would, but there is no guarantee.  If there are big 
differences, this may underscore the problems you have with MD or the 
problems with using a stepwise procedure where small differences in 
variable correlations could produce very different models.  For 
example, in the full sample, X1 might barely edge out X2, but in a 
sample where MD has been eliminated X2 might barely edge out X1.
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX:    (574)288-4373
HOME:   (574)289-5227
EMAIL:  [email protected]
WWW (personal):    http://www.nd.edu/~rwilliam
WWW (department):    http://www.nd.edu/~soc 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/