Home  /  Products  /  Features  /  Survey features

Stata has a number of features designed to handle the special requirements of complex survey data. The survey features will handle probability sampling weights, multiple stages of cluster sampling, stage-level sampling weights, stratification, and poststratification.

Variance estimates are produced using one of the five variance estimation techniques: balanced repeated replication, the bootstrap, the jackknife, successive difference replication, and Taylor linearization. See [SVY] variance estimation for an overview of these techniques.

Many different types of estimation can be performed using Stata's survey facilities:

Descriptive statistics

meanEstimate means
proportionEstimate proportions
ratioEstimate ratios
tabulate (oneway)One-way tables for survey data
tabulate (twoway)Two-way tables for survey data
totalEstimate totals

Linear regression models

churdleCragg hurdle regression
cnsregConstrained linear regression
eintregExtended interval regression
eregressExtended linear regression
etregressLinear regression with endogenous treatment effects
glmGeneralized linear models
hetregressHeteroskedastic linear regression
intregInterval regression
nlNonlinear least-squares estimation
regressLinear regression
tobitTobit regression
truncregTruncated regression

Structural equation models

semStructural equation model estimation command
gsemGeneralized structural equation model estimation command

Survival-data regression models

stcoxCox proportional hazards model
stintregParametric models for interval-censored survival-time data
stregParametric survival models

Binary-response regression models

biprobitBivariate probit regression
cloglogComplementary log-log regression
eprobitExtended probit regression
hetprobitHeteroskedastic probit model
logisticLogistic regression, reporting odds ratios
logitLogistic regression, reporting coefficients
probitProbit regression
scobitSkewed logistic regression

Discrete-response regression models

clogitConditional (fixed-effects) logistic regression
cmmixlogitMixed logit choice model
cmxtmixlogitPanel-data mixed logit choice model
eoprobitExtended ordered probit regression
hetoprobitHeteroskedastic ordered probit regression
mlogitMultinomial (polytomous) logistic regression
mprobitMultinomial probit regression
ologitOrdered logistic regression
oprobitOrdered probit regression
slogitStereotype logistic regression
ziologitZero-inflated ordered logit regression
zioprobitZero-inflated ordered probit regression

Fractional-response regression models

betaregBeta regression
fracregFractional response regression

Poisson regression models

cpoissonCensored Poisson regression
etpoissonPoisson regression with endogenous treatment effects
gnbregGeneralized negative binomial regression in [R] nbreg
nbregNegative binomial regression
poissonPoisson regression
tnbregTruncated negative binomial regression
tpoissonTruncated Poisson regression
zinbZero-inflated negative binomial regression
zipZero-inflated Poisson regression

Instrumental-variables regression models

ivprobitProbit model continuous endogenous covariates
ivregressSingle-equation instrumental-variables regression
ivtobitTobit model with continuous endogenous covariates

Regression models with selection

heckmanHeckman selection model
heckoprobitOrdered probit model with sample selection
heckpoissonPoisson regression with sample selection
heckprobitProbit model with sample selection

Longitudinal/panel-data regression models

xtmlogitFixed-effects and random-effects multinomial logit models

Multilevel mixed-effects models

mecloglogMultilevel mixed-effects complementary log-log regression
meglmMultilevel mixed-effects generalized linear model
meintregMultilevel mixed-effects interval regression
melogitMultilevel mixed-effects logistic regression
menbregMultilevel mixed-effects negative binomial regression
meologitMultilevel mixed-effects ordered logistic regression
meoprobitMultilevel mixed-effects ordered probit regression
mepoissonMultilevel mixed-effects Poisson regression
meprobitMultilevel mixed-effects probit regression
mestregMultilevel mixed-effects parametric survival models
metobitMultilevel mixed-effects tobit regression

Finite mixture models

fmm: betareg Finite mixtures of beta regression models
fmm: cloglog Finite mixtures of complementary log-log regression models
fmm: glm Finite mixtures of generalized linear regression models
fmm: intreg Finite mixtures of interval regression models
fmm: ivregress Finite mixtures of linear regression models with endogenous covariates
fmm: logit Finite mixtures of logistic regression models
fmm: mlogit Finite mixtures of multinomial (polytomous) logistic regression models
fmm: nbreg Finite mixtures of negative binomial regression models
fmm: ologit Finite mixtures of ordered logistic regression models
fmm: oprobit Finite mixtures of ordered probit regression models
fmm: pointmass Finite mixtures models with a density mass at a single point
fmm: poisson Finite mixtures of Poisson regression models
fmm: probit Finite mixtures of probit regression models
fmm: regress Finite mixtures of linear regression models
fmm: streg Finite mixtures of parametric survival models
fmm: tobit Finite mixtures of tobit regression models
fmm: tpoisson Finite mixtures of truncated Poisson regression models
fmm: truncreg Finite mixtures of truncated linear regression models

Item response theory

irt 1plOne-parameter logistic model
irt 2plTwo-parameter logistic model
irt 3plThree-parameter logistic model
irt grmGraded response model
irt nrmNominal response model
irt pcmPartial credit model
irt rsmRating scale model
irt hybridHybrid IRT models

Many other estimation features in Stata are suitable for certain limited survey designs. For example, Stata’s competing-risks regression routine (stcrreg) handles sampling weights properly when sampling weights are specified, and it also handles clustering.

Stata's mixed for fitting multilevel linear models allows for both sampling weights and clustering. Sampling weights may be specified at all levels in your multilevel model, and thus, by necessity, weights need to be treated differently in mixed than in other estimation commands. Some caution on the part of the user is required; see section Survey data in [ME] mixed for details. Also see example of using mixed with survey data.

estat effects computes the design effects DEFF and DEFT, as well as misspecification effects MEFF and MEFT. test, used after svy, computes adjusted Wald tests and Bonferroni tests for linear hypotheses (single or joint).

Here is an example of the use of svy: mean:

. webuse nhanes2

. svyset psu [pw=finalwgt], strata(strata)

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: 

. svy: mean weight
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,351
Number of PSUs   = 62            Population size = 117,157,513
                                 Design df       =          31


                                Design df       =           31

Linearized
Mean std. err. [95% conf. interval]
weight 71.90064 .1654434 71.56321 72.23806

svyset, illustrated above, allows you to set the variables that contain the sampling weights, strata, and any PSU identifiers at the outset. These variables are remembered for subsequent commands and do not have to be reentered.

Estimating the difference between two subpopulation means can be done by running svy: mean with an over() option to produce subpopulation estimates and then running lincom:

. svy: mean weight, over(sex)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,351
Number of PSUs   = 62            Population size = 117,157,513
                                 Design df       =          31

Linearized
Mean std. err. [95% conf. interval]
c.weight@sex
Male 78.62789 .2097761 78.20004 79.05573
Female 65.70701 .266384 65.16372 66.25031

svy: mean, svy: prop, svy: ratio, and svy: total produce estimates for multiple subpopulations:

. svy: mean weight, over(sex race)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31                 Number of obs   =      10,351
Number of PSUs   = 62                 Population size = 117,157,513
                                      Design df       =          31

Linearized
Mean std. err. [95% conf. interval]
c.weight@sex#race
Male#White 78.98862 .2125203 78.55518 79.42206
Male#Black 78.324 .8476215 76.59526 80.05273
Male#Other 68.16404 1.811668 64.46912 71.85896
Female#White 65.10844 .2926873 64.5115 65.70538
Female#Black 72.38252 1.059851 70.22094 74.5441
Female#Other 59.56941 1.325068 56.86692 62.27191

Use estat effects to report DEFF and DEFT.

. estat effects


Linearized
Over Mean std. err. DEFF DEFT
c.
weight@
sex#race
Male#White 78.98862 .2125203 1.15287 1.07372
Male#Black 78.324 .8476215 1.34608 1.16021
Male#Other 68.16404 1.811668 2.08964 1.44556
Female #
White 65.10844 .2926873 2.09219 1.44644
Female #
Black 72.38252 1.059851 1.93387 1.39064
Female #
Other 59.56941 1.325068 1.55682 1.24772

Use estat size to report the number of observations belonging to each subpopulation and estimates of the subpopulation size.

. estat size


Linearized
Over Mean std. err. Obs Size
c.
weight@
sex#race
Male#White 78.98862 .2125203 4,312 49,504,800
Male#Black 78.324 .8476215 500 5,096,044
Male#Other 68.16404 1.811668 103 1,558,636
Female #
White 65.10844 .2926873 4,753 53,494,749
Female #
Black 72.38252 1.059851 586 6,093,192
Female #
Other 59.56941 1.325068 97 1,410,092

You can fit a wide variety of models using svy estimators (see the tables above for a list of available commands). Shown below is an example of svy: logit, which fits logistic regressions for survey data.

. webuse nhanes2d

. svy: logit highbp height weight age c.age#c.age female black
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31
                                                 F(6, 26)        =      231.75
                                                 Prob > F        =      0.0000

Linearized
highbp Coefficient std. err. t P>|t| [95% conf. interval]
height -.0345643 .0053121 -6.51 0.000 -.0453985 -.0237301
weight .051004 .0025292 20.17 0.000 .0458457 .0561622
age .0554544 .0127859 4.34 0.000 .0293774 .0815314
c.age#c.age -.0000676 .0001385 -0.49 0.629 -.0003502 .0002149
female -.4758698 .0561318 -8.48 0.000 -.5903513 -.3613882
black .338201 .1075191 3.15 0.004 .1189143 .5574877
_cons -.5140351 .8747001 -0.59 0.561 -2.297998 1.269928

svy: logit can display estimates as coefficients or as odds ratios. Below we redisplay the previous model, requesting that the estimates be expressed as odds ratios.

. svy: logit, or

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31
                                                 F(6, 26)        =      231.75
                                                 Prob > F        =      0.0000

Linearized
highbp Odds ratio std. err. t P>|t| [95% conf. interval
height .9660262 .0051317 -6.51 0.000 .9556166 .9765492
weight 1.052327 .0026615 20.17 0.000 1.046913 1.057769
age 1.057021 .013515 4.34 0.000 1.029813 1.084947
c.age#c.age .9999324 .0001385 -0.49 0.629 .9996499 1.000215
female .6213444 .0348772 -8.48 0.000 .5541326 .6967085
black 1.402422 .1507872 3.15 0.004 1.126273 1.74628
_cons .5980774 .5231384 -0.59 0.561 .1004598 3.560595

After running a logistic regression, you can use lincom to compute odds ratios for any covariate group relative to another.

. lincom female + black, or

 ( 1)  [highbp]female + [highbp]black = 0

highbp Odds ratio Std. err. t P>|t| [95% conf. interval]
(1) .8713873 .1233177 -0.97 0.338 .6529215 1.162951

You can also fit regression models for a subpopulation:

. svy, subpop(black): logistic highbp age female
(running logistic on estimation sample)

Survey: Logistic regression

Number of strata = 30                            Number of obs   =      10,013
Number of PSUs   = 60                            Population size = 113,415,086
                                                 Subpop. no. obs =       1,086
                                                 Subpop. size    =  11,189,236
                                                 Design df       =          30
                                                 F(2, 29)        =       83.52
                                                 Prob > F        =      0.0000

Linearized
highbp Odds ratio std. err. t P>|t| [95% conf. interval]
age 1.060226 .0047619 13.02 0.000 1.050546 1.069996
female .8280475 .1063299 -1.47 0.152 .6370331 1.076338
_cons .0791591 .0185411 -10.83 0.000 .0490631 .1277163
Note: 1 stratum omitted because it contains no subpopulation members.

Survey data require some special data management. svydescribe can be used to examine the design structure of the dataset. It can also be used to see the number of missing and nonmissing observations per stratum (or optionally per stage) for one or more variables.

. svydescribe hdresult

Survey: Describing stage 1 sampling units

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: 

                              Number of obs with
             Number of units  complete   missing       # obs per included unit
 Stratum  included   omitted      data      data       Min      Mean       Max
1 1* 1 114 266 114 114.0 114 2 1* 1 98 87 98 98.0 98 3 2 0 277 71 116 138.5 161 4 2 0 340 120 160 170.0 180 5 2 0 173 79 81 86.5 92 6 2 0 255 43 116 127.5 139 7 2 0 409 67 191 204.5 218 8 2 0 299 39 129 149.5 170 9 2 0 218 26 85 109.0 133 10 2 0 233 29 103 116.5 130 11 2 0 238 37 97 119.0 141 12 2 0 275 39 121 137.5 154 13 2 0 297 45 123 148.5 174 14 2 0 355 50 167 177.5 188 15 2 0 329 51 151 164.5 178 16 2 0 280 56 134 140.0 146 17 2 0 352 41 155 176.0 197 18 2 0 335 24 135 167.5 200 20 2 0 240 45 95 120.0 145 21 2 0 198 16 91 99.0 107 22 2 0 263 38 116 131.5 147 23 2 0 304 37 143 152.0 161 24 2 0 388 50 182 194.0 206 25 2 0 239 17 106 119.5 133 26 2 0 240 21 119 120.0 121 27 2 0 259 24 127 129.5 132 28 2 0 284 15 131 142.0 153 29 2 0 440 63 193 220.0 247 30 2 0 326 39 147 163.0 179 31 2 0 279 29 121 139.5 158 32 2 0 383 67 180 191.5 203
31 60 2 8,720 1,631 81 145.3 247
10,3511