Search
   >> Home >> Products >> Features >> Survey features
Order Stata

Survey features

Stata has a number of features designed to handle the special requirements of complex survey data. The survey features will handle probability sampling weights, multiple stages of cluster sampling, stage-level sampling weights, stratification, and poststratification.

Variance estimates are produced using one of the five variance estimation techniques: balanced repeated replication, the bootstrap, the jackknife, successive difference replication, and Taylor linearization. See [SVY] variance estimation for an overview of these techniques.

Many different types of estimation can be performed using Stata's survey facilities:

Descriptive statistics

meanEstimate means
proportionEstimate proportions
ratioEstimate ratios
totalEstimate totals

Linear regression models

churdleCragg hurdle regression
cnsregConstrained linear regression
etregressLinear regression with endogenous treatment effects
glmGeneralized linear models
intregInterval regression
nlNonlinear least-squares estimation
regressLinear regression
tobitTobit regression
truncregTruncated regression

Structural equation models

semStructural equation model estimation command
gsemGeneralized structural equation model estimation command

Survival-data regression models

stcoxCox proportional hazards model
stregParametric survival models

Binary-response regression models

biprobitBivariate probit regression
cloglogComplementary log-log regression
hetprobitHeteroskedastic probit model
logisticLogistic regression, reporting odds ratios
logitLogistic regression, reporting coefficients
probitProbit regression
scobitSkewed logistic regression

Discrete-response regression models

clogitConditional (fixed-effects) logistic regression
mlogitMultinomial (polytomous) logistic regression
mprobitMultinomial probit regression
ologitOrdered logistic regression
oprobitOrdered probit regression
slogitStereotype logistic regression

Fractional-response regression models

betaregBeta regression
fracregFractional response regression

Poisson regression models

cpoissonCensored Poisson regression
gnbregGeneralized negative binomial regression in [R] nbreg
nbregNegative binomial regression
poissonPoisson regression
tnbregTruncated negative binomial regression
tpoissonTruncated Poisson regression
zinbZero-inflated negative binomial regression
zipZero-inflated Poisson regression

Instrumental-variables regression models

ivprobitProbit model continuous endogenous covariates
ivregressSingle-equation instrumental-variables regression
ivtobitTobit model with continuous endogenous covariates

Regression models with selection

heckmanHeckman selection model
heckoprobitOrdered probit model with sample selection
heckprobitProbit model with sample selection

Multilevel mixed-effects models

mecloglogMultilevel mixed-effects complementary log-log regression
meglmMultilevel mixed-effects generalized linear model
melogitMultilevel mixed-effects logistic regression
menbregMultilevel mixed-effects negative binomial regression
meologitmultilevel mixed-effects ordered logistic regression
meoprobitMultilevel mixed-effects ordered probit regression
mepoissonMultilevel mixed-effects Poisson regression
meprobitMultilevel mixed-effects probit regression
mestrgMultilevel mixed-effects parametric survival models

Item response theory

irt 1plOne-parameter logistic model
irt 2plTwo-parameter logistic model
irt 3plThree-parameter logistic model
irt grmGraded response model
irt nrmNominal response model
irt pcmPartial credit model
irt rsmRating scale model
irt hybridHybrid IRT models

Many other estimation features in Stata are suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.

Stata's mixed for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see section Survey data in [ME] mixed for details. Also see example of using mixed with survey data.

estat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. test, used after svy, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).

Here is an example of the use of svy: mean:

. webuse nhanes2 . svyset psu [pw=finalwgt], strata(strata) pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> . svy: mean weight (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31
Linearized
Mean Std. Err. [95% Conf. Interval]
weight 71.90064 .1654434 71.56321 72.23806

svyset, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.

Estimating the difference between two subpopulation means can be done by running svy: mean with a over() option to produce subpopulation estimates and then running lincom:

. svy: mean weight, over(sex) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 Male: sex = Male Female: sex = Female
Linearized
Over Mean Std. Err. [95% Conf. Interval]
weight
Male 78.62789 .2097761 78.20004 79.05573
Female 65.70701 .266384 65.16372 66.25031

svy: mean, svy: prop, svy: ratio, and svy: total produce estimates for multiple subpopulations:

. svy: mean weight, over(sex race) (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. [95% Conf. Interval]
weight
_subpop_1 78.98862 .2125203 78.55518 79.42206
_subpop_2 78.324 .8476215 76.59526 80.05273
_subpop_3 68.16404 1.811668 64.46912 71.85896
_subpop_4 65.10844 .2926873 64.5115 65.70538
_subpop_5 72.38252 1.059851 70.22094 74.5441
_subpop_6 59.56941 1.325068 56.86692 62.27191

Use estat effects to report DEFF and DEFT.

. estat effects Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. DEFF DEFT
weight
_subpop_1 78.98862 .2125203 1.15287 1.07372
_subpop_2 78.324 .8476215 1.34608 1.16021
_subpop_3 68.16404 1.811668 2.08964 1.44556
_subpop_4 65.10844 .2926873 2.09219 1.44644
_subpop_5 72.38252 1.059851 1.93387 1.39064
_subpop_6 59.56941 1.325068 1.55682 1.24772

Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.

. estat size Over: sex race _subpop_1: Male White _subpop_2: Male Black _subpop_3: Male Other _subpop_4: Female White _subpop_5: Female Black _subpop_6: Female Other
Linearized
Over Mean Std. Err. Obs Size
weight
_subpop_1 78.98862 .2125203 4,312 49,504,800
_subpop_2 78.324 .8476215 500 5,096,044
_subpop_3 68.16404 1.811668 103 1,558,636
_subpop_4 65.10844 .2926873 4,753 53,494,749
_subpop_5 72.38252 1.059851 586 6,093,192
_subpop_6 59.56941 1.325068 97 1,410,092

You can fit linear regressions, logistic regressions, and probit models using svy estimators. Shown below is an example of svy: logit, which fits logistic regressions for survey data.

. webuse nhanes2d . svy: logit highbp height weight age c.age#c.age female black (running logit on estimation sample) Survey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F( 6, 26) = 231.75 Prob > F = 0.0000
Linearized
highbp Coef. Std. Err. t P>|t| [95% Conf. Interval]
height -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301
weight .051004 .0025292 20.17 0.000 .0458457 .0561622
age .0554544 .0127859 4.34 0.000 .0293774 .0815314
c.age#c.age -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149
female -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882
black .338201 .1075191 3.15 0.004 .1189143 .5574877
_cons -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928

svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.

. svy: logit, or Survey: Logistic regression Number of strata = 31 Number of obs = 10,351 Number of PSUs = 62 Population size = 117,157,513 Design df = 31 F( 6, 26) = 231.75 Prob > F = 0.0000
Linearized
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
height .9660262 .0051317 -6.51 0.000 .9556166 .9765492
weight 1.052327 .0026615 20.17 0.000 1.046913 1.057769
age 1.057021 .013515 4.34 0.000 1.029813 1.084947
c.age#c.age .9999324 .0001385 -0.49 0.629 .9996499 1.000215
female .6213444 .0348772 -8.48 0.000 .5541326 .6967085
black 1.402422 .1507872 3.15 0.004 1.126273 1.74628
_cons .5980774 .5231384 -0.59 0.561 .1004598 3.560595

After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.

. lincom female + black, or ( 1) [highbp]female + [highbp]black = 0
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
(1) .8713873 .1233177 -0.97 0.338 .6529215 1.162951

You can also fit linear regressions, logistic regressions, and probit models for a subpopulation:

. svy, subpop(black): logistic highbp age female (running logistic on estimation sample) Survey: Logistic regression Number of strata = 30 Number of obs = 10,013 Number of PSUs = 60 Population size = 113,415,086 Subpop. no. obs = 1,086 Subpop. size = 11,189,236 Design df = 30 F( 2, 29) = 83.52 Prob > F = 0.0000
Linearized
highbp Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
age 1.060226 .0047619 13.02 0.000 1.050546 1.069996
female .8280475 .1063299 -1.47 0.152 .6370331 1.076338
_cons .0791591 .0185411 -10.83 0.000 .0490631 .1277163
Note: 1 stratum omitted because it contains no subpopulation members.

Survey data require some special data management. svydescribe can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.

. svydescribe hdresult Survey: Describing stage 1 sampling units pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero> #Obs with #Obs with #Obs per included Unit #Units #Units complete missing ___________________________ Stratum included omitted data data min mean max
1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203
31 60 2 8720 1631 81 145.3 247
10351

See New in Stata 14 for more about what was added in Stata 14.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube