Stata has a number of commands designed to handle the special requirements of complex survey data. The commands will handle any or all of the following survey-design features: probability sampling weights, stratification, multiple stages of cluster sampling, and poststratification. There are commands for estimating means, totals, ratios, and proportions; and commands for linear regression, logistic regression, probit models, and survey estimators for sampling designs; see the table below for a complete listing of svy commands.

Variance estimates are produced using one of the five variance estimation techniques: balanced repeated replication, the bootstrap, the jackknife, successive difference replication, and Taylor linearization.

The Stata estimation commands designed to handle the special requirements of complex survey data work with the svy prefix:

svy: biprobit | Bivariate probit regression for survey data | svy: nl | Nonlinear least-squares estimation for survey data |
---|---|---|---|

svy: clogit | Conditional (fixed-effects) logistic regression for survey data | svy: ologit | Ordered logistic regression for survey data |

svy: cloglog | Complementary log-log regression for survey data | svy: oprobit | Ordered probit regression for survey data |

svy: cnsreg | Constrained linear regression for survey data | svy: poisson | Poisson regression for survey data |

svy: etregress | Linear regression with endogenous treatment effects | svy: probit | Probit regression for survey data |

svy: glm | Generalized linear models for survey data | svy: proportion | Estimate proportions for survey data |

svy: gnbreg | Generalized negative binomial regression for survey data | svy: ratio | Estimate ratios for survey data |

svy: heckman | Heckman selection model for survey data | svy: regress | Linear regression for survey data |

svy: heckoprobit | Ordered probit model with sample selection for survey data | svy: scobit | Skewed logistic regression for survey data |

svy: heckprobit | Probit model with sample selection for survey data | svy: sem | Structural equation modeling for survey data |

svy: hetprobit | Heteroskedastic probit regression for survey data | svy: slogit | Stereotype logistic regression for survey data |

svy: intreg | Interval regression for survey data | svy: stcox | Cox proportional hazards model for survey data |

svy: ivprobit | Probit model with endogenous regressors for survey data | svy: streg | Parametric survival models for survey data |

svy: ivregress | Single-equation instrumental-variables regression for survey data | svy: tnbreg | Truncated negative binomial regression for survey data |

svy: ivtobit | Tobit model with endogenous regressors for survey data | svy: tobit | Tobit regression for survey data |

svy: logistic | Logistic regression for survey data, reporting odds ratios | svy: total | Estimate totals for survey data |

svy: logit | Logistic regression for survey data, reporting coefficients | svy: tpoisson | Truncated Poisson regression for survey data |

svy: mean | Estimate means for survey data | svy: truncreg | Truncated regression for survey data |

svy: mlogit | Multinomial (polytomous) logistic regression for survey data | svy: zinb | Zero-inflated negative binomial regression for survey data |

svy: mprobit | Multinomial probit regression for survey data | svy: zip | Zero-inflated Poisson regression for survey data |

svy: nbreg | Negative binomial regression for survey data |

Many other estimation commands in Stata also have features that make them suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.

Stata's mixed command for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see sectionestat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. The test command, used after a svy estimation command, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).

Here is an example of the use of the svy: mean command:

Linearized | ||

Mean Std. Err. [95% Conf. Interval] | ||

weight | 71.90064 .1654434 71.56321 72.23806 | |

The svyset command, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.

Estimating the difference between two subpopulation means can be done by running svy: mean with a over() option to produce subpopulation estimates and then running the command lincom:

Linearized | ||

Over | Mean Std. Err. [95% Conf. Interval] | |

weight | ||

Male | 78.62789 .2097761 78.20004 79.05573 | |

Female | 65.70701 .266384 65.16372 66.25031 | |

The svy: mean, svy: prop, svy: ratio, and svy: total commands produce estimates for multiple subpopulations:

Linearized | ||

Over | Mean Std. Err. [95% Conf. Interval] | |

weight | ||

_subpop_1 | 78.98862 .2125203 78.55518 79.42206 | |

_subpop_2 | 78.324 .8476215 76.59526 80.05273 | |

_subpop_3 | 68.16404 1.811668 64.46912 71.85896 | |

_subpop_4 | 65.10844 .2926873 64.5115 65.70538 | |

_subpop_5 | 72.38252 1.059851 70.22094 74.5441 | |

_subpop_6 | 59.56941 1.325068 56.86692 62.27191 | |

Use estat effects to report DEFF and DEFT.

Linearized | ||

Over | Mean Std. Err. DEFF DEFT | |

weight | ||

_subpop_1 | 78.98862 .2125203 1.15287 1.07372 | |

_subpop_2 | 78.324 .8476215 1.34608 1.16021 | |

_subpop_3 | 68.16404 1.811668 2.08964 1.44556 | |

_subpop_4 | 65.10844 .2926873 2.09219 1.44644 | |

_subpop_5 | 72.38252 1.059851 1.93387 1.39064 | |

_subpop_6 | 59.56941 1.325068 1.55682 1.24772 | |

Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.

Linearized | ||

Over | Mean Std. Err. Obs Size | |

weight | ||

_subpop_1 | 78.98862 .2125203 4312 49504800 | |

_subpop_2 | 78.324 .8476215 500 5096044 | |

_subpop_3 | 68.16404 1.811668 103 1558636 | |

_subpop_4 | 65.10844 .2926873 4753 53494749 | |

_subpop_5 | 72.38252 1.059851 586 6093192 | |

_subpop_6 | 59.56941 1.325068 97 1410092 | |

You can fit linear regressions, logistic regressions, and probit models using svy estimators. Shown below is an example of svy: logit, which fits logistic regressions for survey data.

Linearized | ||

highbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

height | -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301 | |

weight | .051004 .0025292 20.17 0.000 .0458457 .0561622 | |

age | .0554544 .0127859 4.34 0.000 .0293774 .0815314 | |

c.age#c.age | -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149 | |

female | -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882 | |

black | .338201 .1075191 3.15 0.004 .1189143 .5574877 | |

_cons | -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928 | |

svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.

Linearized | ||

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

height | .9660262 .0051317 -6.51 0.000 .9556166 .9765492 | |

weight | 1.052327 .0026615 20.17 0.000 1.046913 1.057769 | |

age | 1.057021 .013515 4.34 0.000 1.029813 1.084947 | |

c.age#c.age | .9999324 .0001385 -0.49 0.629 .9996499 1.000215 | |

female | .6213444 .0348772 -8.48 0.000 .5541326 .6967085 | |

black | 1.402422 .1507872 3.15 0.004 1.126273 1.74628 | |

_cons | .5980774 .5231384 -0.59 0.561 .1004598 3.560595 | |

After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

(1) | .8713873 .1233177 -0.97 0.338 .6529215 1.162951 | |

You can also fit linear regressions, logistic regressions, and probit models for a subpopulation:

Linearized | ||

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

age | 1.060226 .0047619 13.02 0.000 1.050546 1.069996 | |

female | .8280475 .1063299 -1.47 0.152 .6370331 1.076338 | |

_cons | .0791591 .0185411 -10.83 0.000 .0490631 .1277163 | |

Survey data require some special data management. The svydescribe command can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.

1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203 | ||||||||||||||

31 60 2 8720 1631 81 145.3 247 | ||||||||||||||

10351 |

See
**New in Stata 13**
for more about what was added in Stata 13.