Stata has a number of features designed to handle the special requirements of complex survey data. The survey features will handle probability sampling weights, multiple stages of cluster sampling, stage-level sampling weights, stratification, and poststratification.

Variance estimates are produced using one of the five variance
estimation techniques: balanced repeated replication, the bootstrap,
the jackknife, successive difference replication, and Taylor linearization.
See **[SVY]** variance estimation for an overview of these techniques.

Many different types of estimation can be performed using Stata's survey facilities:

**Descriptive statistics**

mean | Estimate means |
---|---|

proportion | Estimate proportions |

ratio | Estimate ratios |

total | Estimate totals |

**Linear regression models**

churdle | Cragg hurdle regression |
---|---|

cnsreg | Constrained linear regression |

etregress | Linear regression with endogenous treatment effects |

glm | Generalized linear models |

intreg | Interval regression |

nl | Nonlinear least-squares estimation |

regress | Linear regression |

tobit | Tobit regression |

truncreg | Truncated regression |

**Structural equation models**

sem | Structural equation model estimation command |
---|---|

gsem | Generalized structural equation model estimation command |

**Survival-data regression models**

stcox | Cox proportional hazards model |
---|---|

streg | Parametric survival models |

**Binary-response regression models**

biprobit | Bivariate probit regression |
---|---|

cloglog | Complementary log-log regression |

hetprobit | Heteroskedastic probit model |

logistic | Logistic regression, reporting odds ratios |

logit | Logistic regression, reporting coefficients |

probit | Probit regression |

scobit | Skewed logistic regression |

**Discrete-response regression models**

clogit | Conditional (fixed-effects) logistic regression |
---|---|

mlogit | Multinomial (polytomous) logistic regression |

mprobit | Multinomial probit regression |

ologit | Ordered logistic regression |

oprobit | Ordered probit regression |

slogit | Stereotype logistic regression |

**Fractional-response regression models**

betareg | Beta regression |
---|---|

fracreg | Fractional response regression |

**Poisson regression models**

cpoisson | Censored Poisson regression |
---|---|

gnbreg | Generalized negative binomial regression in [R] nbreg |

nbreg | Negative binomial regression |

poisson | Poisson regression |

tnbreg | Truncated negative binomial regression |

tpoisson | Truncated Poisson regression |

zinb | Zero-inflated negative binomial regression |

zip | Zero-inflated Poisson regression |

**Instrumental-variables regression models**

ivprobit | Probit model continuous endogenous covariates |
---|---|

ivregress | Single-equation instrumental-variables regression |

ivtobit | Tobit model with continuous endogenous covariates |

**Regression models with selection**

heckman | Heckman selection model |
---|---|

heckoprobit | Ordered probit model with sample selection |

heckprobit | Probit model with sample selection |

**Multilevel mixed-effects models**

mecloglog | Multilevel mixed-effects complementary log-log regression |
---|---|

meglm | Multilevel mixed-effects generalized linear model |

melogit | Multilevel mixed-effects logistic regression |

menbreg | Multilevel mixed-effects negative binomial regression |

meologit | multilevel mixed-effects ordered logistic regression |

meoprobit | Multilevel mixed-effects ordered probit regression |

mepoisson | Multilevel mixed-effects Poisson regression |

meprobit | Multilevel mixed-effects probit regression |

mestrg | Multilevel mixed-effects parametric survival models |

**Item response theory**

irt 1pl | One-parameter logistic model |
---|---|

irt 2pl | Two-parameter logistic model |

irt 3pl | Three-parameter logistic model |

irt grm | Graded response model |

irt nrm | Nominal response model |

irt pcm | Partial credit model |

irt rsm | Rating scale model |

irt hybrid | Hybrid IRT models |

Many other estimation features in Stata are suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.

Stata's mixed for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see sectionestat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. test, used after svy, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).

Here is an example of the use of svy: mean:

. webuse nhanes2 . svyset psu [pw=finalwgt], strata(strata)pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero>. svy: mean weight(running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31

Linearized | ||

Mean Std. Err. [95% Conf. Interval] | ||

weight | 71.90064 .1654434 71.56321 72.23806 | |

svyset, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.

Estimating the difference between two subpopulation means can be done by running svy: mean with a over() option to produce subpopulation estimates and then running lincom:

. svy: mean weight, over(sex)(running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 Male: sex = Male Female: sex = Female

Linearized | ||

Over | Mean Std. Err. [95% Conf. Interval] | |

weight | ||

Male | 78.62789 .2097761 78.20004 79.05573 | |

Female | 65.70701 .266384 65.16372 66.25031 | |

svy: mean, svy: prop, svy: ratio, and svy: total produce estimates for multiple subpopulations:

. svy: mean weight, over(sex race)(running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other

Linearized | ||

Over | Mean Std. Err. [95% Conf. Interval] | |

weight | ||

_subpop_1 | 78.98862 .2125203 78.55518 79.42206 | |

_subpop_2 | 78.324 .8476215 76.59526 80.05273 | |

_subpop_3 | 68.16404 1.811668 64.46912 71.85896 | |

_subpop_4 | 65.10844 .2926873 64.5115 65.70538 | |

_subpop_5 | 72.38252 1.059851 70.22094 74.5441 | |

_subpop_6 | 59.56941 1.325068 56.86692 62.27191 | |

Use estat effects to report DEFF and DEFT.

. estat effectsOver: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other

Linearized | ||

Over | Mean Std. Err. DEFF DEFT | |

weight | ||

_subpop_1 | 78.98862 .2125203 1.15287 1.07372 | |

_subpop_2 | 78.324 .8476215 1.34608 1.16021 | |

_subpop_3 | 68.16404 1.811668 2.08964 1.44556 | |

_subpop_4 | 65.10844 .2926873 2.09219 1.44644 | |

_subpop_5 | 72.38252 1.059851 1.93387 1.39064 | |

_subpop_6 | 59.56941 1.325068 1.55682 1.24772 | |

Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.

. estat sizeOver: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other

Linearized | ||

Over | Mean Std. Err. Obs Size | |

weight | ||

_subpop_1 | 78.98862 .2125203 4,312 49,504,800 | |

_subpop_2 | 78.324 .8476215 500 5,096,044 | |

_subpop_3 | 68.16404 1.811668 103 1,558,636 | |

_subpop_4 | 65.10844 .2926873 4,753 53,494,749 | |

_subpop_5 | 72.38252 1.059851 586 6,093,192 | |

_subpop_6 | 59.56941 1.325068 97 1,410,092 | |

You can fit linear regressions, logistic regressions, and probit models using svy estimators. Shown below is an example of svy: logit, which fits logistic regressions for survey data.

. webuse nhanes2d . svy: logit highbp height weight age c.age#c.age female black(running logit on estimation sample) Survey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F( 6, 26) = 231.75 Prob > F = 0.0000

Linearized | ||

highbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

height | -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301 | |

weight | .051004 .0025292 20.17 0.000 .0458457 .0561622 | |

age | .0554544 .0127859 4.34 0.000 .0293774 .0815314 | |

c.age#c.age | -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149 | |

female | -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882 | |

black | .338201 .1075191 3.15 0.004 .1189143 .5574877 | |

_cons | -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928 | |

svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.

. svy: logit, orSurvey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F( 6, 26) = 231.75 Prob > F = 0.0000

Linearized | ||

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

height | .9660262 .0051317 -6.51 0.000 .9556166 .9765492 | |

weight | 1.052327 .0026615 20.17 0.000 1.046913 1.057769 | |

age | 1.057021 .013515 4.34 0.000 1.029813 1.084947 | |

c.age#c.age | .9999324 .0001385 -0.49 0.629 .9996499 1.000215 | |

female | .6213444 .0348772 -8.48 0.000 .5541326 .6967085 | |

black | 1.402422 .1507872 3.15 0.004 1.126273 1.74628 | |

_cons | .5980774 .5231384 -0.59 0.561 .1004598 3.560595 | |

After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.

. lincom female + black, or( 1) [highbp]female + [highbp]black = 0

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

(1) | .8713873 .1233177 -0.97 0.338 .6529215 1.162951 | |

You can also fit linear regressions, logistic regressions, and probit models for a subpopulation:

. svy, subpop(black): logistic highbp age female(running logistic on estimation sample) Survey: Logistic regression Number of strata = 30 Number of obs = 10,013 Number of PSUs = 60 Population size = 113,415,086 Subpop. no. obs = 1,086 Subpop. size = 11,189,236 Design df = 30 F( 2, 29) = 83.52 Prob > F = 0.0000

Linearized | ||

highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] | |

age | 1.060226 .0047619 13.02 0.000 1.050546 1.069996 | |

female | .8280475 .1063299 -1.47 0.152 .6370331 1.076338 | |

_cons | .0791591 .0185411 -10.83 0.000 .0490631 .1277163 | |

Survey data require some special data management. svydescribe can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.

. svydescribe hdresultSurvey: Describing stage 1 sampling units pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> #Obs with #Obs with #Obs per included Unit #Units #Units complete missing ___________________________ Stratum included omitted data data min mean max

1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203 | ||||||||||||||

31 60 2 8720 1631 81 145.3 247 | ||||||||||||||

10351 |

See
**New in Stata 14**
for more about what was added in Stata 14.