|Stepwise regression with the svy commands
|William Sribney, StataCorp
The stepwise prefix command in Stata does not work with svy: logit or any other svy commands. Most search-lots-of-possibilities stepwise procedures are not sound statistically, and most statisticians would not recommend them.
For a list of problems with stepwise procedures, see the FAQ: What are some of the problems with stepwise regression?
To these reasons, let me add that using stepwise methods for cluster-sampled data is even more problematic because the effective degrees of freedom is bounded by the number of clusters. Thus we have no plans to allow the svy commands to work with the stepwise procedure.
If you do not have a priori hypotheses to test, then model building is really an art. I recommend that you do what I call “planned backward block stepwise regression”. Other people call this “hierarchical stepwise regression”.
. svy: logit y a1 a2 ... b1 b2 ... c1 c2 ... h1 h2 ...
. test h1 h2 ...
Steps 1–4 alone are problematic because of multiple comparisons. One should really do a Bonferroni correction for testing the groups.
That is, if you have K groups of covariates to test, you should use a significance level of 0.05/K. This is a stringent procedure but is the only statistically sound thing to do, in my opinion. Remember that ideally one should be testing M a priori hypotheses each at a level of 0.05/M, so you should not be rewarded for not having a priori hypotheses! But, for the sake of having something to publish, a Bonferroni correction is usually not done.
If you did not have survey data, I would recommend doing the above procedure coupled with a split sample approach: you divide your sample into two parts, develop a model on one part, and then try to confirm it on the other. (When it does not get confirmed, you will be stuck, so you will make sure that you have a priori hypotheses for the next study you are involved with.)
Splitting up survey data, however, is a dicey proposition if you have only a moderate number of clusters (PSUs) because you should keep clusters whole. Thus you should only split survey data if you have many clusters in each stratum.