Hi Statalist,
After some thoughts I feel it is unfair to the discussion of stepwise
techniques if I withhold my arguments for the so-called 'alternative'
stepwise method. I don't know if Statalist is the correct place for this,
so if not please let me know. I have not found any faults in Richard's
comments, but I feel that his considerations are not necessarily the most
important in the common situation where sample size is small and the
number of potential predictors is large, which is usually the case when
stepwise regression is needed. Say we have n = 200, and a potential pool
of predictors = 50, say that each of these 50 predictors have 1 or 2
missing, not necesarily randomly. Using the Stata stepwise procedure, we
may well end up with a final model with some 5 variables, but this model
was derived only using around 75% of the sample, and most likely not a
random sample. Would it not be wiser to use all available observations at
each try? Intuitively I feel that this final model might be less biased
because it does not involve throwing as much information away (1% vs 25%),
although I believe mathematically this would be quite difficult to prove.
Richard, re-running step-wise on the selected model does not produce the
same results, does it?
Tim
Richard Williams <[email protected]>
Sent by: [email protected]
01/09/2006 21:54
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: st: stepwise
At 10:15 AM 9/1/2006, [email protected] wrote:
>I don't see why you can't work with all available data at each try.
>Arguably there is the down side that you're comparing models with
>different number of observations. But it just bothers me that at the end
>of the day I have a final model that doesn't have the same results as
when
>I simply enter the variables. Moreover, if there are lots of variables,
we
>may end up running the procedure on only half of the data, which is a bit
>stupid I think. Alan, thanks for the suggestion of multiple imputation,
>but that's not my concern at the moment, because I won't be using it
>simply because it's too complicated. In any case, how do you run stepwise
>regression on several different imputed datasets and decide on one final
>one at the end?
As to why it happens - I believe both stepwise and nestreg do
listwise deletion based on all the vars you have specified. Hence, a
variable with missing data that doesn't make the final cut can still
cause your sample size to be reduced.
I don't know this for a fact, but it wouldn't surprise me if listwise
deletion is a universal or near-universal practice for stepwise
programs. I see your point, but how exactly would the alternative
work? Consider some practical problems:
* Suppose X1 has 100% data but X2 has 50% missing. If you are going
by p values, X1 has a built in advantage just because its N is
larger. This can also be true of all subsequent variables examined;
a larger N gives one variable an advantage over others.
* The formulas used are (or can be) based on comparisons of model
fit, e.g. how much does the addition of one or more variables cause
the residual sums of squares to go down? Such comparisons should be
based on the same sample being used. If the sample size is suddenly
cut in half, you'll see big differences in the residual sums of
squares, but that is primarily because the sample is much smaller
once you add the new variable.
* Suppose data are missing non-randomly. To take an extreme case,
suppose X2 has so much missing data because it was only asked of
females. X1 might enter first in the full sample, but not enter
first (or even at all) in the sample with women only. . This point
especially concerns me - while you may be able to come up with
alternative formulas that get around the first 2 problems, this
problem would seem to be there regardless.
* In short, the whole purpose of stepwise is to select variables
based on comparisons of effects. But, it is hard to make and justify
comparisons of variable effects if the sample analyzed is a moving
target as you go from one variable to the next.
There would also seem to be workarounds for your concern. If X1 and
X2 were the lucky survivors from stepwise, then just run a new
regression using X1 and X2; or perhaps run a new stepwise using only
those variables. If all is well, you presumably will find that the
coefficients are about the same as they were with the original
stepwise but they are more significant because of the higher N. If
coefficients are quite a big different, then that may suggest
systematic biases in your missing data.
Incidentally, I am ignoring for now all the concerns that can be
raised about stepwise regression! But the alternative you have in
mind, I think, would simply add to those concerns. If anyone has
contrary views or knows of stepwise routines that behave like Timothy
would like, I'd be interested in hearing about them.
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: [email protected]
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/