Home  /  Resources & support  /  FAQs  /  Stepwise regression with svy commands

Is there a way in Stata to do stepwise regression with svy: logit or any of the svy commands?

Title   Stepwise regression with the svy commands
Author William Sribney, StataCorp

The stepwise prefix command in Stata does not work with svy: logit or any other svy commands. Most search-lots-of-possibilities stepwise procedures are not sound statistically, and most statisticians would not recommend them.

For a list of problems with stepwise procedures, see the FAQ: What are some of the problems with stepwise regression?

To these reasons, let me add that using stepwise methods for cluster-sampled data is even more problematic because the effective degrees of freedom is bounded by the number of clusters. Thus we have no plans to allow the svy commands to work with the stepwise procedure.

If you do not have a priori hypotheses to test, then model building is really an art. I recommend that you do what I call “planned backward block stepwise regression”. Other people call this “hierarchical stepwise regression”.

That is,

  1. Arrange your covariates into logical groupings. I will call the groupings {a1, a2, ...}, {b1, b2, ...}, {c1, c2, ...}, .... Order the groupings so that the ones that you think a priori are least important are last.
  2. Run your full model. E.g.,
     . svy: logit y a1 a2 ... b1 b2 ... c1 c2 ... h1 h2 ...  
  3. Test the last group (the least important):
     . test h1 h2 ...  
    If it is not significant, discard the whole group. If it is significant, keep the whole group.
  4. Then test the second-to-last group, etc.
  5. When you have tested all the groups and have kept only the significant ones, do the same procedure with each covariate. This last step should be considered optional for two reasons. First, it may not make sense to split up the covariates in the group (e.g., they may be dummies for a categorical variable). Second, performing yet more tests is not a good thing. But people usually cannot stand leaving nonsignificant terms in their “final” model. However, overfitting is better than overtesting!

Steps 1–4 alone are problematic because of multiple comparisons. One should really do a Bonferroni correction for testing the groups.

That is, if you have K groups of covariates to test, you should use a significance level of 0.05/K. This is a stringent procedure but is the only statistically sound thing to do, in my opinion. Remember that ideally one should be testing M a priori hypotheses each at a level of 0.05/M, so you should not be rewarded for not having a priori hypotheses! But, for the sake of having something to publish, a Bonferroni correction is usually not done.

If you did not have survey data, I would recommend doing the above procedure coupled with a split sample approach: you divide your sample into two parts, develop a model on one part, and then try to confirm it on the other. (When it does not get confirmed, you will be stuck, so you will make sure that you have a priori hypotheses for the next study you are involved with.)

Splitting up survey data, however, is a dicey proposition if you have only a moderate number of clusters (PSUs) because you should keep clusters whole. Thus you should only split survey data if you have many clusters in each stratum.