Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: stepwise


From   Richard Williams <Richard.A.Williams.5@ND.Edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: stepwise
Date   Mon, 04 Sep 2006 11:17:35 -0400

At 10:19 AM 9/4/2006, Timothy.Mak@iop.kcl.ac.uk wrote:
Hi Richard,

Do you know how the SPSS pairwise procedure work? I don't think it works
the way I wanted it to work.
Pairwise is one of the options on the SPSS regression command. It will compute the correlations using pairwise deletion and then do the calculations from there. Pairwise isn't too bad if data are missing randomly and MD is scattered across cases. However, nonrandom missing data can be a problem. Pairwise deletion and other commonly used (and misused) techniques are discussed in my handout at

http://www.nd.edu/~rwilliam/stats2/l12.pdf


Now I'm really curious about why you suggested using a lower cut-off than
.05. In fact I was going to use 0.15, as suggested in Hosmer and Lemeshow.
Those two may know more than I do! But the concern is that, with stepwise, vars can enter into the equation just by chance. So, suppose your final model has X1-X3, and you make a big deal about your profound discovery of the importance of these vars. But, just by chance alone, if you have 50 vars and alpha = .05, you would expect about 3 vars to enter into the equation. The problem is compounded if you don't tell people you used stepwise and make it sound like it was your great theory that identified those 3 winners!

The counter-argument, I guess, is that you want to increase the likelihood that important controls are being included. However, keep in mind that if you have 50 vars and alpha = .15, then just by chance alone 7 or 8 could make it in.

My brief handout on stepwise is at

http://www.nd.edu/~rwilliam/stats1/x95.pdf

One other followup on what I said before:

I don't know about other fields, but in the Social Sciences it is quite common to ask several questions that all tap the same underlying attitude, e.g. there might be 6 questions that measure self-efficacy, another 6 questions that tap political liberalism, etc. If you try to include all these variables in a regression you have a problem because you're basically including the same variable operationalized in 6 different ways. But, choosing only one of the 6 can be a problem too, since it isn't obvious which question is the "best." So typically, you would use factor analysis or some other scale construction technique; and besides having fewer vars, if done right the resulting scale should be more reliable than the individual measures were.

I don't know about this particular data set, but I'd be a little surprised if X1-X50 measured 50 unique concepts. Ergo, I'd be tempted to try scale construction before I'd use stepwise to make the fine-line distinction between X1 and X2.

For a very brief discussion, see

http://www.nd.edu/~rwilliam/stats2/l25.pdf


-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: Richard.A.Williams.5@ND.Edu
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index