Title | Using svyset for stratified multiple-stage designs | |

Author | Jeffrey Pitblado, StataCorp | |

Date | May 2006; updated July 2011 |

Suppose you are faced with analyzing data from the following survey design:

The population was sampled by stratifying it first and then randomly selecting several clusters for each stratum. Within each cluster, subclusters were randomly selected, and then for each subcluster individuals were randomly selected.

Your first question when analyzing survey data should always be:

How do I identify the sampling design using svyset in Stata?

Starting in Stata 9, svyset has a syntax to deal with multiple stages of clustered sampling.

Let’s make up some variable names to represent survey design characteristics:

pwt | sampling weights |
---|---|

strata1 | stage 1 strata |

su1 | stage 1 sampling units (PSU) |

fpc1 | stage 1 finite population correction |

strata2 | stage 2 strata |

su2 | stage 2 sampling units (SSU) |

fpc2 | stage 2 finite population correction |

... you get the idea.

Given the description above, the svyset command should be structured as follows:

svyset su1 [pw=pwt], strata(strata1) fpc(fpc1) /// || su2, fpc(fpc2) || _n, fpc(fpc3)(

Prior to Stata 9, where **svyset** accepted only the first-stage
design variables, one might assume that the **svyset** command
should be as follows:

svyset [pweight=pwt], fpc(fpc1) psu(su1) strata(strata1)

When using only the first-stage design characteristics, you must be aware
that specifying an FPC implies there was no sampling within the PSUs.
If this is not true, then specifying an FPC for the first stage will yield
negatively biased standard errors; that is, the standard error estimates will
be smaller than they should. In this case, we recommend you not
**svyset** an FPC.

If we remove the **fpc()** option, then

svyset [pweight=pwt], psu(su1) strata(strata1)

will produce appropriate variance estimates, even for multistage designs.

The previous assertion is also valid if you are using the modern syntax
for **svyset**, but, for some reason, you can only specify the first-stage
characteristics. For example, some datasets come only with information
on stratification and sampling units on the first stage, even if they
have been collected via a multistage design. If this is the case,
**fpc()** should not be used for the reasons explained above.

In a current Stata, you can specify the design variables for each stage,
using **||** to delimit the stages.

Now suppose the design involved cluster sampling first, and then each cluster was stratified before the subclusters were sampled. Here we stratified in the second stage but not the first, so we should have a variable like strata2 instead of strata1:

svyset su1 [pw=pwt], fpc(fpc1) /// || su2, strata(strata2) fpc(fpc2) || _n, fpc(fpc3)

If our design involved stratified cluster sampling in both the first and second stages, the svyset command would be as follows:

svyset su1 [pw=pwt], strata(strata1) fpc(fpc1) /// || su2, strata(strata2) fpc(fpc2) || _n, fpc(fpc3)

In a current Stata, you need to know from which stage a stratum variable identifies the strata. See [SVY] svyset for more examples of how to svyset multistage designs.

Prior to Stata 9, you would use the strata() option only if your design had stratification in the first stage.