Survey commands
Stata has a number of commands designed to handle the special requirements
of complex survey data. The commands will handle any or all of the
following survey-design features: probability sampling weights,
stratification, multiple stages of cluster sampling, and poststratification.
There are commands for estimating means, totals, ratios, and proportions;
and commands for linear regression, logistic regression, probit models, and
survey estimators for sampling designs; see the table below for a complete
listing of svy commands.
Variance estimates are produced using one of the five variance
estimation techniques: balanced repeated replication, the bootstrap,
the jackknife, successive difference replication, and Taylor linearization.
The Stata estimation commands designed to handle the special requirements of
complex survey data work with the svy prefix:
| svy: biprobit | Bivariate probit regression for survey data |
svy: ologit | Ordered logistic regression for survey data |
| svy: clogit | Conditional (fixed-effects) logistic regression for survey data |
svy: oprobit | Ordered probit regression for survey data |
| svy: cloglog | Complementary log-log regression for survey data |
svy: poisson | Poisson regression for survey data |
| svy: cnsreg | Constrained linear regression for survey data |
svy: probit | Probit regression for survey data |
| svy: glm | Generalized linear models for survey data |
svy: proportion | Estimate proportions for survey data |
| svy: gnbreg | Generalized negative binomial regression for survey data |
svy: ratio | Estimate ratios for survey data |
| svy: heckman | Heckman selection model for survey data |
svy: regress | Linear regression for survey data |
| svy: heckprob | Probit model with sample selection for survey data |
svy: scobit | Skewed logistic regression for survey data |
| svy: hetprob | Heteroskedastic probit regression for survey data |
svy: sem | Structural equation modeling for survey data |
| svy: intreg | Interval regression for survey data |
svy: slogit | Stereotype logistic regression for survey data |
| svy: ivprobit | Probit model with endogenous regressors for survey data |
svy: stcox | Cox proportional hazards model for survey data |
| svy: ivregress | Single-equation instrumental-variables regression for survey data |
svy: streg | Parametric survival models for survey data |
| svy: ivtobit | Tobit model with endogenous regressors for survey data |
svy: tnbreg | Truncated negative binomial regression for survey data |
| svy: logistic | Logistic regression for survey data, reporting odds ratios |
svy: tobit | Tobit regression for survey data |
| svy: logit | Logistic regression for survey data, reporting coefficients |
svy: total | Estimate totals for survey data |
| svy: mean | Estimate means for survey data |
svy: tpoisson | Truncated Poisson regression for survey data |
| svy: mlogit | Multinomial (polytomous) logistic regression for survey data |
svy: treatreg | Treatment-effects regression for survey data |
| svy: mprobit | Multinomial probit regression for survey data |
svy: truncreg | Truncated regression for survey data |
| svy: nbreg | Negative binomial regression for survey data |
svy: zinb | Zero-inflated negative binomial regression for survey data |
| svy: nl | Nonlinear least-squares estimation for survey data |
svy: zip | Zero-inflated Poisson regression for survey data |
Many other estimation commands in Stata also have features that make
them suitable for certain limited survey designs. For example, Stata’s competing-risks
regression routine (stcrreg) handles sampling weights properly when
sampling weights are specified, and it also handles clustering.
Stata's xtmixed command for fitting multilevel
linear models allows for both sampling weights and clustering. Sampling
weights may be specified at all levels in your multilevel model, and thus,
by necessity, weights need to be treated differently in
xtmixed than in other estimation commands. Some
caution on the part of the user is required; see section "Survey data" in
[XT] xtmixed for details. Also
see example of using
xtmixed with survey data.
estat effects computes the design effects DEFF and DEFT, as well as
misspecification effects MEFF and MEFT. The test command, used after
a svy estimation command, computes adjusted Wald tests and Bonferroni
tests for linear hypotheses (single or joint).
Here is an example of the use of the svy: mean command:
. webuse nhanes2
. svyset psu [pw=finalwgt], strata(strata)
pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>
. svy: mean weight
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 31 Number of obs = 10351
Number of PSUs = 62 Population size = 117157513
Design df = 31
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
weight | 71.90064 .1654434 71.56321 72.23806
--------------------------------------------------------------
The svyset command, illustrated above, allows you to set the
variables that contain the sampling weights, strata, and any PSU identifiers
at the outset. These variables are remembered for subsequent commands and
do not have to be reentered.
Estimating the difference between two subpopulation means can be done by
running svy: mean with a
over() option to produce subpopulation
estimates and then running the command lincom:
. svy: mean weight, over(sex)
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 31 Number of obs = 10351
Number of PSUs = 62 Population size = 117157513
Design df = 31
Male: sex = Male
Female: sex = Female
--------------------------------------------------------------
| Linearized
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
weight |
Male | 78.62789 .2097761 78.20004 79.05573
Female | 65.70701 .266384 65.16372 66.25031
--------------------------------------------------------------
The svy: mean, svy: prop,
svy: ratio, and svy: total
commands produce estimates for multiple subpopulations:
. svy: mean weight, over(sex race)
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 31 Number of obs = 10351
Number of PSUs = 62 Population size = 117157513
Design df = 31
Over: sex race
_subpop_1: Male White
_subpop_2: Male Black
_subpop_3: Male Other
_subpop_4: Female White
_subpop_5: Female Black
_subpop_6: Female Other
--------------------------------------------------------------
| Linearized
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
weight |
_subpop_1 | 78.98862 .2125203 78.55518 79.42206
_subpop_2 | 78.324 .8476215 76.59526 80.05273
_subpop_3 | 68.16404 1.811668 64.46912 71.85896
_subpop_4 | 65.10844 .2926873 64.5115 65.70538
_subpop_5 | 72.38252 1.059851 70.22094 74.5441
_subpop_6 | 59.56941 1.325068 56.86692 62.27191
--------------------------------------------------------------
Use estat effects to report DEFF and DEFT.
. estat effects
Over: sex race
_subpop_1: Male White
_subpop_2: Male Black
_subpop_3: Male Other
_subpop_4: Female White
_subpop_5: Female Black
_subpop_6: Female Other
----------------------------------------------------------
| Linearized
Over | Mean Std. Err. DEFF DEFT
-------------+--------------------------------------------
weight |
_subpop_1 | 78.98862 .2125203 1.15287 1.07372
_subpop_2 | 78.324 .8476215 1.34608 1.16021
_subpop_3 | 68.16404 1.811668 2.08964 1.44556
_subpop_4 | 65.10844 .2926873 2.09219 1.44644
_subpop_5 | 72.38252 1.059851 1.93387 1.39064
_subpop_6 | 59.56941 1.325068 1.55682 1.24772
----------------------------------------------------------
Use estat size to report the number of observations belonging to each
subpopulation and estimates of the subpopulation size.
. estat size
Over: sex race
_subpop_1: Male White
_subpop_2: Male Black
_subpop_3: Male Other
_subpop_4: Female White
_subpop_5: Female Black
_subpop_6: Female Other
----------------------------------------------------------------------
| Linearized
Over | Mean Std. Err. Obs Size
-------------+--------------------------------------------------------
weight |
_subpop_1 | 78.98862 .2125203 4312 49504800
_subpop_2 | 78.324 .8476215 500 5096044
_subpop_3 | 68.16404 1.811668 103 1558636
_subpop_4 | 65.10844 .2926873 4753 53494749
_subpop_5 | 72.38252 1.059851 586 6093192
_subpop_6 | 59.56941 1.325068 97 1410092
----------------------------------------------------------------------
You can fit linear regressions, logistic regressions, and probit models
using svy estimators. Shown below is an example of
svy: logit, which fits logistic regressions for
survey data.
webuse nhanes2d
. svy: logit highbp height weight age c.age#c.age female black
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 31 Number of obs = 10351
Number of PSUs = 62 Population size = 117157513
Design df = 31
F( 6, 26) = 87.70
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
highbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | -.0325996 .0058727 -5.55 0.000 -.0445771 -.0206222
weight | .049074 .0031966 15.35 0.000 .0425545 .0555936
age | .1541151 .0208709 7.38 0.000 .1115486 .1966815
|
c.age#c.age | -.0010746 .0002025 -5.31 0.000 -.0014877 -.0006616
|
female | -.356497 .0885354 -4.03 0.000 -.537066 -.1759279
black | .3429301 .1409005 2.43 0.021 .0555615 .6302986
_cons | -4.89574 1.159135 -4.22 0.000 -7.259813 -2.531668
------------------------------------------------------------------------------
svy: logit can display estimates as coefficients or
as odds ratios. Below we redisplay the previous model, requesting that the
estimates be expressed as odds ratios.
. svy: logit, or
Survey: Logistic regression
Number of strata = 31 Number of obs = 10351
Number of PSUs = 62 Population size = 117157513
Design df = 31
F( 6, 26) = 87.70
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | .967926 .0056843 -5.55 0.000 .9564019 .979589
weight | 1.050298 .0033574 15.35 0.000 1.043473 1.057168
age | 1.166625 .0243485 7.38 0.000 1.118008 1.217356
|
c.age#c.age | .998926 .0002023 -5.31 0.000 .9985135 .9993386
|
female | .7001246 .0619858 -4.03 0.000 .5844605 .8386784
black | 1.40907 .1985388 2.43 0.021 1.057134 1.878171
------------------------------------------------------------------------------
After running a logistic regression, you can use
lincom to compute odds ratios for any covariate
group relative to another.
. lincom female + black, or
( 1) [highbp]female + [highbp]black = 0
------------------------------------------------------------------------------
highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .9865247 .1631648 -0.08 0.935 .7040616 1.382309
------------------------------------------------------------------------------
You can also fit linear regressions, logistic regressions, and probit models
for a subpopulation:
. svy, subpop(black): logistic highbp age female
(running logistic on estimation sample)
Survey: Logistic regression
Number of strata = 30 Number of obs = 10013
Number of PSUs = 60 Population size = 113415086
Subpop. no. of obs = 1086
Subpop. size = 11189236
Design df = 30
F( 2, 29) = 41.92
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
highbp | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.047957 .0053211 9.23 0.000 1.037146 1.058881
female | .9660029 .1419876 -0.24 0.816 .7155019 1.304206
------------------------------------------------------------------------------
Note: 1 stratum omitted because it contains no subpopulation members.
Survey data require some special data management. The
svydescribe command can be used to examine the design
structure of the dataset. It can also be used to see the number of missing
and nonmissing observations per stratum (or optionally per stage) for one or
more variables.
. svydescribe hdresult
Survey: Describing stage 1 sampling units
pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1:
#Obs with #Obs with #Obs per included Unit
#Units #Units complete missing ----------------------------
Stratum included omitted data data min mean max
-------- -------- -------- -------- -------- -------- -------- --------
1 1* 1 114 266 114 114.0 114
2 1* 1 98 87 98 98.0 98
3 2 0 277 71 116 138.5 161
4 2 0 340 120 160 170.0 180
5 2 0 173 79 81 86.5 92
6 2 0 255 43 116 127.5 139
7 2 0 409 67 191 204.5 218
8 2 0 299 39 129 149.5 170
9 2 0 218 26 85 109.0 133
10 2 0 233 29 103 116.5 130
11 2 0 238 37 97 119.0 141
12 2 0 275 39 121 137.5 154
13 2 0 297 45 123 148.5 174
14 2 0 355 50 167 177.5 188
15 2 0 329 51 151 164.5 178
16 2 0 280 56 134 140.0 146
17 2 0 352 41 155 176.0 197
18 2 0 335 24 135 167.5 200
20 2 0 240 45 95 120.0 145
21 2 0 198 16 91 99.0 107
22 2 0 263 38 116 131.5 147
23 2 0 304 37 143 152.0 161
24 2 0 388 50 182 194.0 206
25 2 0 239 17 106 119.5 133
26 2 0 240 21 119 120.0 121
27 2 0 259 24 127 129.5 132
28 2 0 284 15 131 142.0 153
29 2 0 440 63 193 220.0 247
30 2 0 326 39 147 163.0 179
31 2 0 279 29 121 139.5 158
32 2 0 383 67 180 191.5 203
-------- -------- -------- -------- -------- -------- -------- --------
31 60 2 8720 1631 81 145.3 247
------------------
10351
See
New in Stata 12
for more about what was added in Stata Release 12.
|