Is there a way in Stata to do stepwise regression with svy: logit or
any of the svy commands?
|
Title
|
|
Stepwise regression with the svy commands
|
|
Author
|
William Sribney, StataCorp
|
|
Date
|
May 1998; updated July 2009
|
The stepwise
prefix command in Stata does not work with
svy: logit
or any other
svy commands.
Most search-lots-of-possibilities stepwise procedures are not sound
statistically, and most statisticians would not recommend them.
For a list of problems with stepwise procedures, see the FAQ:
What are some of the problems with stepwise regression?
To these reasons, let me add that using stepwise methods for cluster-sampled
data is even more problematic because the effective degrees of freedom is bounded
by the number of clusters. Thus we have no plans to allow the
svy commands to work with the stepwise
procedure.
If you do not have a priori hypotheses to test, then model
building is really an art. I recommend that you do what I call
“planned backward block stepwise regression”. Other people
call this “hierarchical stepwise regression”.
That is,
- Arrange your covariates into logical groupings. I will call the
groupings {a1, a2, ...}, {b1, b2, ...}, {c1, c2, ...}, .... Order the
groupings so that the ones that you think a priori are least
important are last.
- Run your full model. E.g.,
. svy: logit y a1 a2 ... b1 b2 ... c1 c2 ... h1 h2 ... |
- Test the last group (the least important):
If it is not significant, discard the whole group. If it is
significant, keep the whole group.
- Then test the second-to-last group, etc.
- When you have tested all the groups and have kept only the significant
ones, do the same procedure with each covariate. This last step should
be considered optional for two reasons. First, it may not make sense
to split up the covariates in the group (e.g., they may be dummies for
a categorical variable). Second, performing yet more tests is not a
good thing. But people usually cannot stand leaving nonsignificant
terms in their “final” model. However, overfitting is
better than overtesting!
Steps 1–4 alone are problematic because of multiple comparisons. One
should really do a Bonferroni correction for testing the groups.
That is, if you have K groups of covariates to test, you should use a
significance level of 0.05/K. This is a stringent procedure but is the only
statistically sound thing to do, in my opinion. Remember that ideally one
should be testing M a priori hypotheses each at a level of 0.05/M, so
you should not be rewarded for not having a priori hypotheses!
But, for the sake of having something to publish, a Bonferroni correction is
usually not done. (Cheating is OK if everyone else does it, too, right?)
If you did not have survey data, I would recommend doing the above
procedure coupled with a split sample approach: you divide your sample into
two parts, develop a model on one part, and then try to confirm it on the
other. (When it does not get confirmed, you will be stuck, so
you will make sure that you have a priori hypotheses for the next
study you are involved with.)
Splitting up survey data, however, is a dicey proposition if you have only a
moderate number of clusters (PSUs) because you should keep clusters whole.
Thus you should only split survey data if you have many clusters in each
stratum.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|