    # Re: st: analysis question related to svyset command

 From jpitblado@stata.com (Jeff Pitblado, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: analysis question related to svyset command Date Wed, 15 Nov 2006 13:40:18 -0600

```Lawrence Hanser <lhanser@gmail.com> has a question about specifying -svyset-
for his survey dataset:

> We have completed a rather complicated (to us) survey sample. The
> sample was stratified on eight variables, for example, one variable
> was race and another was gender.  We are only interested on making
> comparisons on four variables at a time, but for different sets of
> four variables.
>
> My question is with regard to Stata's svyset command.  Should we set
> the strata using all eight variables and then proceed with our
> four-variable comparisons?  Or should we set the strata for the four
> variables we are including in the analyses at hand, and change the
> strata to reflect a different set of four variables for each set of
> analyses?

It seems that Lawrence in mixing two separate concepts: stratification and
subpopulation estimation.

***** Concept 1: Stratification

The -svyset- command provides a way for you to specify how the survey data was
sampled.  Thus if the dataset came from a simple random sample without
replacement, you would type

. svyset _n, fpc(sf)

where '_n' identifies the observations as the sampling units, and the
-sf- variable contains the fraction of the population that was sampled, also
known as the sampling fraction.  The sampling fraction is used in the finite
population correction, FPC.

For a stratified design, sampling units are independently selected within each
stratum.  For example, if we had a large classroom of students and
sampled 10% of the males and 20% of the females, then we would have a
stratified sample using gender to identify the strata.  In Stata you would
type

. svyset _n, strata(gender) fpc(gender_sf)

where the -gender- variable identifies the males and females, and the
-gender_sf- variable contains the sampling fraction for males and females in
the population corresponding with -gender- (e.g. gender_sf == .1 for males and
gender_sf == .2 for females).

Once the survey design is implemented and the sample data is collected into a
Stata dataset, you should only have to use -svyset- once to identify the
survey design characteristics.

Sometimes the strata are identified using more than one criterion.  For
example we might stratify on gender and age group at the same time.  So if our
strata criterion were

gender:		male, female
age group:	child, teenager, adult, senior

then the strata (in no particular order) would be

stratum 1:	male	child
stratum 2:	male	teenager
stratum 4:	male	senior
stratum 5:	female	child
stratum 6:	female	teenager
stratum 8:	female	senior

In Stata, -svyset- allows only one stratum variable per sampling stage, thus a
stratum variable would have to be generated if we only had variables for
gender and age group, separately.  Continuing with this example, we could
easily generate this stratum variable using

. egen mystrata = group(gender age_group)

where the -gender- variable identifies males and females, and the -age_group-
variable identifies the children, teenagers, adults, and seniors.  Now we
could type

. svyset _n, strata(mystrata) fpc(str_sf)

were the -str_sf- variable contains the sampling fraction corresponding to
-mystrata-.

Even though we used two criteria to identify the strata, we need one
strata variable in order to -svyset- the design characteristics used to
collect the data.

Once you -svyset- the survey design characteristics, you can save the dataset
so that the -svyset-ting are stored with the data on disk.  Now you can use
this new dataset without having to worry about the survey design
characteristics, they have already been -svyset-.

***** Concept 2: subpopulations

Comparing subgroups of the population is pretty common in survey data
analysis.  We call these subgroups: subpopulations.  For this reason the
-mean-, -proportion-, -ratio-, and -total- commands have the -over()- option
that allows you to identify multiple subpopulations "over" which to estimate
population means, proportions, ratios, and totals.

Continuing the example from above, we could compare the mean body mass index
(BMI) between males and females by typing

. svy: mean bmi, over(gender)

where the -bmi- variable contains the BMI for each sampled individual.  The
-svy- prefix takes care of all the survey design characteristics while
interacting with the -mean- command.  The result is a table of mean BMI
estimates for the male and female subpopulations.

The -over()- option allows more than one variable, so we could compare the
mean BMI between the gender and age group combinations.  In Stata you would
type

. svy: mean bmi, over(gender age_group)

The result is a table of mean BMI estimates for each observed combination of
gender and age group in the dataset.

Notice that this subpopulation analysis require a specific sampling design,
all you need is one or more Stata variables to identify the subpopulation
observations.

You could even compare regression coefficients between two or more
subpopulations by using the -subpop()- option of -svy- and the -suest-
command.  Suppose we were interested in a linear regression involving BMI and
some demographic variables: exercise indicator and age in years.  We could
fit this model for the entire population by typing

. svy: regress bmi exercise age

To compare the coefficients on -exercise- and -age- between males and females
we would have to store the estimation results for each subpopulation then use
-suest- to combine the results for comparison.  In Stata this is done by

. * regression for the subpopulation of males
. svy, subpop(if gender=="male") : regress bmi exercise age
. estimates store male

. * regression for the subpopulation of females
. svy, subpop(if gender=="female") : regress bmi exercise age
. estimates store female

. * combine the estimation results from the two subpopulations
. suest male female

(Here we assume -gender- is a string variable using "male" and "female" to
identify males and females.)

The result is a table of regression parameter estimates for each gender
subpopulation.  We could then test for equality of the regression coefficients
by typing

. test [male_mean=female_mean]

where 'male_mean' and 'female_mean' are names created by the -suest- command
that identify the regression parameters for each subpopulation.

--Jeff