Search
   >> Home >> Products >> Features >> Survey commands

Survey commands

Stata has a number of commands designed to handle the special requirements of complex survey data. The commands will handle any or all of the following survey-design features: probability sampling weights, stratification, multiple stages of cluster sampling, and poststratification. There are commands for estimating means, totals, ratios, and proportions; and commands for linear regression, logistic regression, probit models, and survey estimators for sampling designs; see the table below for a complete listing of svy commands.

Variance estimates are produced using one of the five variance estimation techniques: balanced repeated replication, the bootstrap, the jackknife, successive difference replication, and Taylor linearization.

The Stata estimation commands designed to handle the special requirements of complex survey data work with the svy prefix:

svy: biprobit Bivariate probit regression for survey data svy: nl Nonlinear least-squares estimation for survey data
svy: clogit Conditional (fixed-effects) logistic regression for survey data svy: ologit Ordered logistic regression for survey data
svy: cloglog Complementary log-log regression for survey data svy: oprobit Ordered probit regression for survey data
svy: cnsreg Constrained linear regression for survey data svy: poisson Poisson regression for survey data
svy: etregress Linear regression with endogenous treatment effects svy: probit Probit regression for survey data
svy: glm Generalized linear models for survey data svy: proportionEstimate proportions for survey data
svy: gnbreg Generalized negative binomial regression for survey data svy: ratio Estimate ratios for survey data
svy: heckman Heckman selection model for survey data svy: regress Linear regression for survey data
svy: heckoprobitOrdered probit model with sample selection for survey data svy: scobit Skewed logistic regression for survey data
svy: heckprobitProbit model with sample selection for survey data svy: sem Structural equation modeling for survey data
svy: hetprobit Heteroskedastic probit regression for survey data svy: slogit Stereotype logistic regression for survey data
svy: intreg Interval regression for survey data svy: stcox Cox proportional hazards model for survey data
svy: ivprobit Probit model with endogenous regressors for survey data svy: streg Parametric survival models for survey data
svy: ivregress Single-equation instrumental-variables regression for survey data svy: tnbreg Truncated negative binomial regression for survey data
svy: ivtobit Tobit model with endogenous regressors for survey data svy: tobit Tobit regression for survey data
svy: logistic Logistic regression for survey data, reporting odds ratios svy: total Estimate totals for survey data
svy: logit Logistic regression for survey data, reporting coefficients svy: tpoisson Truncated Poisson regression for survey data
svy: mean Estimate means for survey data svy: truncreg Truncated regression for survey data
svy: mlogit Multinomial (polytomous) logistic regression for survey data svy: zinb Zero-inflated negative binomial regression for survey data
svy: mprobit Multinomial probit regression for survey data svy: zip Zero-inflated Poisson regression for survey data
svy: nbreg Negative binomial regression for survey data    

Many other estimation commands in Stata also have features that make them suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.

Stata's mixed command for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see section Survey data in [ME] mixed for details. Also see example of using mixed with survey data.

estat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. The test command, used after a svy estimation command, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).

Here is an example of the use of the svy: mean command:

. webuse nhanes2 . svyset psu [pw=finalwgt], strata(strata) pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> . svy: mean weight (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10351 Number of PSUs = 62 Population size = 117157513 Design df = 31
Linearized
Mean Std. Err. [95% Conf. Interval]
weight 71.90064 .1654434 71.56321 72.23806

The svyset command, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.

Estimating the difference between two subpopulation means can be done by running svy: mean with a over() option to produce subpopulation estimates and then running the command lincom:

. svy: mean weight, over(sex) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10351 Number of PSUs = 62 Population size = 117157513 Design df = 31 Male: sex = Male Female: sex = Female
Linearized
Over Mean Std. Err. [95% Conf. Interval]
weight
Male 78.62789 .2097761 78.20004 79.05573
Female 65.70701 .266384 65.16372 66.25031

The svy: mean, svy: prop, svy: ratio, and svy: total commands produce estimates for multiple subpopulations:

. svy: mean weight, over(sex race) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10351 Number of PSUs = 62 Population size = 117157513 Design df = 31 Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. [95% Conf. Interval]
weight
_subpop_1 78.98862 .2125203 78.55518 79.42206
_subpop_2 78.324 .8476215 76.59526 80.05273
_subpop_3 68.16404 1.811668 64.46912 71.85896
_subpop_4 65.10844 .2926873 64.5115 65.70538
_subpop_5 72.38252 1.059851 70.22094 74.5441
_subpop_6 59.56941 1.325068 56.86692 62.27191

Use estat effects to report DEFF and DEFT.

. estat effects Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. DEFF DEFT
weight
_subpop_1 78.98862 .2125203 1.15287 1.07372
_subpop_2 78.324 .8476215 1.34608 1.16021
_subpop_3 68.16404 1.811668 2.08964 1.44556
_subpop_4 65.10844 .2926873 2.09219 1.44644
_subpop_5 72.38252 1.059851 1.93387 1.39064
_subpop_6 59.56941 1.325068 1.55682 1.24772

Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.

. estat size Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. Obs Size
weight
_subpop_1 78.98862 .2125203 4312 49504800
_subpop_2 78.324 .8476215 500 5096044
_subpop_3 68.16404 1.811668 103 1558636
_subpop_4 65.10844 .2926873 4753 53494749
_subpop_5 72.38252 1.059851 586 6093192
_subpop_6 59.56941 1.325068 97 1410092

You can fit linear regressions, logistic regressions, and probit models using svy estimators. Shown below is an example of svy: logit, which fits logistic regressions for survey data.

. webuse nhanes2d . svy: logit highbp height weight age c.age#c.age female black (running logit on estimation sample) Survey: Logistic regression Number of strata = 31 Number of obs = 10351 Number of PSUs = 62 Population size = 117157513 Design df = 31 F( 6, 26) = 231.75 Prob > F = 0.0000
Linearized
highbp Coef. Std. Err. t P>|t| [95% Conf. Interval]
height -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301
weight .051004 .0025292 20.17 0.000 .0458457 .0561622
age .0554544 .0127859 4.34 0.000 .0293774 .0815314
c.age#c.age -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149
female -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882
black .338201 .1075191 3.15 0.004 .1189143 .5574877
_cons -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928

svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.

. svy: logit, or Survey: Logistic regression Number of strata = 31 Number of obs = 10351 Number of PSUs = 62 Population size = 117157513 Design df = 31 F( 6, 26) = 87.70 Prob > F = 0.0000
Linearized
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
height .9660262 .0051317 -6.51 0.000 .9556166 .9765492
weight 1.052327 .0026615 20.17 0.000 1.046913 1.057769
age 1.057021 .013515 4.34 0.000 1.029813 1.084947
c.age#c.age .9999324 .0001385 -0.49 0.629 .9996499 1.000215
female .6213444 .0348772 -8.48 0.000 .5541326 .6967085
black 1.402422 .1507872 3.15 0.004 1.126273 1.74628
_cons .5980774 .5231384 -0.59 0.561 .1004598 3.560595

After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.

. lincom female + black, or ( 1) [highbp]female + [highbp]black = 0
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
(1) .8713873 .1233177 -0.97 0.338 .6529215 1.162951

You can also fit linear regressions, logistic regressions, and probit models for a subpopulation:

. svy, subpop(black): logistic highbp age female (running logistic on estimation sample) Survey: Logistic regression Number of strata = 30 Number of obs = 10013 Number of PSUs = 60 Population size = 113415086 Subpop. no. of obs = 1086 Subpop. size = 11189236 Design df = 30 F( 2, 29) = 83.52 Prob > F = 0.0000
Linearized
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
age 1.060226 .0047619 13.02 0.000 1.050546 1.069996
female .8280475 .1063299 -1.47 0.152 .6370331 1.076338
_cons .0791591 .0185411 -10.83 0.000 .0490631 .1277163
Note: 1 stratum omitted because it contains no subpopulation members.

Survey data require some special data management. The svydescribe command can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.

. svydescribe hdresult Survey: Describing stage 1 sampling units pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> #Obs with #Obs with #Obs per included Unit #Units #Units complete missing ___________________________ Stratum included omitted data data min mean max
1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203
31 60 2 8720 1631 81 145.3 247
10351

See New in Stata 13 for more about what was added in Stata 13.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube