Steven Samuels <[email protected]> believes there is something
wrong with
how Stata's -svy- commands handle certainty sampling units, i.e.
sampling
units chosen with 100% certainty.
Our short reply is that there is nothing wrong here. We believe
Steven is
misinterpreting the documentation he is quoting. Stata allows us
to -svyset-
our survey data as we see fit, the only things the -svy- commands
disallow are
things that violate the definitions of the survey design variables.
First we'll qualify part of Steven's quote from the [SVY] manual.
Then we'll
address Steven's concern using an hypothetical example dataset
similar in
design to the one he mentions.
The Stata V 10 "Survey Data" manual states (page 54):
Stata's -svy- commands identify strata with an FPC equal to one as
units sampled with certainty. To properly determine the design
degrees of freedom, certainty units should be contained in their own
strata, one for each certainty unit in each sampling stage. Although
the observations contained in certainty units have a role in
parameter estimation, they contribute nothing to the variance.
If this is Stata's approach, then it is wrong.
To qualify the last sentence, we probably should have written:
... Although the observations contained in certainty units have a
role
in parameter estimation, they contribute nothing to the variance at
the stage where the FPC is 1.
Here FPC is the variable supplied to the -fpc()- option of -
svyset-. When the
FPC variable is 1, this means a sampling rate of 100%. When a 100%
sampling
rate is plugged into the variance formula, it results in a zero
multiplier,
thus the corresponding observations contribute nothing the variance
for that
stage; however, these observations could and usually do contribute
to the
variance in subsequent stages.
The only substantive topic left to discuss is the calculation of
degrees of
freedom. The rest of Steven's email describe an example survey
dataset with 3
certainty PSUs.
Let me give a simple example of a survey to study prevalence of a
disease. The target area was divided into 27 districts. Three were
known to have high prevalence and all were taken into the study. The
other districts divided into 'medium' and 'low' incidence strata. In
these two strata, samples of districts were drawn. The structure of
the design is this:
Strata Sampling Unit-Stage 1
1. District 1 village
2. District 2 village
3. District 3 village
4. Medium Incidence district
5. Low Incidence district
(Subsequent stages below the village level are: household, person.)
In other words, each of the certainty districts is not contained in
its own stratum; it is a stratum. Variation between the the
primary sampling units in these strata, as in the others, does indeed
contribute information to the variance. This is the standard
treatment for for self-representing" units, the traditional name for
certainty units. See Graham Kalton, Introduction to Survey Sampling,
Sage Publications 1983, page 85.
In this example, Steven is treating districts 1 to 3 as strata, but
the rest
of the districts are primary sampling units (PSUs) which have been
partitioned
into two strata--low and medium incidence.
Suppose we have a Stata dataset with the following variables:
pw -- the sampling weights
incidence -- 1 is "low", 2 is "medium", 3 is "high"
district -- numbered from 1 to the number of sampled districts;
where districts 1, 2, and 3 are assumed to be in
the "high" incidence stratum; our certainty units
village -- a generic village identifier, contiguous numbers
starting from 1
household -- a generic household identifier, contiguous numbers
starting from 1
With these variables also assume we have the sampling rates
corresponding to
each observation for each of the sampling stages.
fpc1 -- sampling rate of the districts within stratum, thus
this variable is 1 for all observations in which
incidence is "high"
fpc2 -- sampling rate of the villages within district
To simplify things, we'll assume a small sampling rate for the
households or
that the data supplier didn't publish it with in dataset. This
also makes the
sampling rate of persons within household irrelevant for variance
estimation.
One might think the correct way to -svyset- this data is:
svyset district [pw=pw], fpc(fpc1) strata(incidence) ///
|| village , fpc(fpc2) ///
|| household
This seems reasonable, be we believe that the model degrees of
freedom from
this setup are slightly inflated. Suppose in addition to districts
1, 2, and
3, that 16 other districts were sampled. By this setup the degrees
of freedom
would be 16 = 19-3.
A more conservative view is that the certainty units in the first
stage should
not be counted among the design degrees of freedom. We believe
this is
strictly true for a one-stage clustered design, where 100% of the
individuals
contained with the sampled clusters are observed. For multi-stage
design,
we acknowledge that this view might be considered too conservative;
however,
Steven's quote from [SVY] is directly inspired by this conservative
view.
We can cause Stata's -svy- commands to use this conservative method
for
degrees of freedom by reassigning a unique stratum id to each of
the certainty
units. For example, since all our stratum and sampling unit
identifiers are
positive integers, we can generate a pseudo-stratum identifier
variable using
negated district values to uniquely identify the certainty units.
gen p_str = cond(incidence==3, -district, incidence)
We can then use the new p_str variable in the -strata()- option for
the first
stage:
svyset district [pw=pw], fpc(fpc1) strata(incidence) ///
|| village , fpc(fpc2) ///
|| household
This will result in 14=19-5 degrees of freedom.
A third approach is to generate pseudo-sampling unit variables that
directly
reflect Steven's above setup. We will use the same negated values
trick to
generate the pseudo-unit variables for each stage:
gen p_su1 = cond(incidence == 3, -village, district)
gen p_su2 = cond(incidence == 3, -household, village)
gen p_su3 = cond(incidence == 3, -_n, household)
gen p_fpc1 = cond(incidence == 3, fpc2, fpc1)
gen p_fpc2 = cond(incidence == 3, 0, fpc2)
gen p_fpc3 = cond(incidence == 3, 1, 0)
Note that we had to generate pseudo-FPC variables to accommodate
the fact that
we are treating districts 1, 2, and 3 as pseudo-strata, the
villages within
them as pseudo-units is stage 1, the households within these
village as
pseudo-units in stage 2, and the individuals within these
households as
pseudo-units in stage 3.
svyset p_su1 [pw=pw] , fpc(p_fpc1) strata(p_str) ///
|| p_su2 , fpc(p_fpc2) ///
|| p_su3 , fpc(p_fpc3)
Now suppose that of the sampled villages, 60 came from the
certainty units
(districts 1, 2, and 3). Thus there are now 76=60+16 PSUs and 5
strata,
yielding 71=76-5 degrees of freedom. Note that 60 might be an
exceedingly
high number for sampled villages for the design that Steven
references; we're
using 60 to emphasize that the typical number of sampling units differ
drastically between stages thus the resulting design degrees of
freedom can
vary drastically depending on how you -svyset- the data.
Now we must recognize that we set this problem up so that it was
easier for us
to identify the districts, villages, and households in the dataset. A
professional data distributer might not publish the information,
especially if
it is public data; preferring the pseudo-strata and pseudo-units
variables in
this case. As survey data analysts, we should always follow the
guidelines
that are published with the data we are analyzing; these guidelines
usually tell us which variables identify the survey design
characteristics
that are subsequently used to -svyset- the data.
Note that the above 3 -svyset-s will yield identical point and
variance
(standard error) estimates. See the do-file that follows my
signature for an
example that shows this to be true.
In closing, we refer to chapter 5 (especially section 5.2 and 5.3)
of Korn and
Graubard (1999) which describes several alternative methods for
computing
degrees of freedom for survey dataset with certainty units and/or
strata with
one sampling unit. Wolter (2007) also contains examples of survey
datasets
that contain certainty units.
References
----------
Korn, E. L. and B. I. Graubard. 1999. Analysis of Health Surveys.
New York: Wiley.
Woleter, K. 2007. Introduction to variance estimation.
New York: Springer.
--Jeff
[email protected]
***** BEGIN: certainty.do
set seed 1234
set sortseed 1234
// incidence = stage 1 strata
// district = stage 1 units
local low = .25 + .5*uniform()
local low = 12
local med = 12
local nlow = ceil(`low'*(.3 + .9*uniform()))
local nmed = ceil(`med'*(.3 + .9*uniform()))
local nobs = 3 + `nlow' + `nmed'
set obs `nobs'
gen district = _n
gen incidence = cond(_n<4, 3, cond(_n<3+`nlow', 1, 2))
label define incidence 1 "low" 2 "medium" 3 "high"
label values incidence incidence
gen fpc1 = 1
replace fpc1 = `nlow'/`low' if incidence == 1
replace fpc1 = `nmed'/`med' if incidence == 2
tab district incidence
sum fpc?
// village = stage 2 units
gen nobs = 5 + ceil(20*uniform())
expand nobs
local sort incidence district
sort `sort'
by `sort' : gen fpc2 = .1 + round(.5*uniform(), .01) if _n == 1
by `sort' : replace fpc2 = fpc2[1]
gen village = _n
// household = stage 3 units
replace nobs = 10 + ceil(10*uniform())
expand nobs
local sort incidence district village
sort `sort'
gen household = _n
// person = stage 4 units
replace nobs = ceil(6*uniform())
expand nobs
gen pw = 1/(fpc1*fpc2*.05*.05)
gen y = incidence*invnormal(uniform()) + district + village
drop nobs
tab district incidence
sum fpc?
svyset district [pw=pw] , fpc(fpc1) strata(incidence) ///
|| village , fpc(fpc2) ///
|| household
svy: mean y
est store svyset1
gen p_str = cond(incidence == 3, -district, incidence)
svyset district [pw=pw] , fpc(fpc1) strata(p_str) ///
|| village , fpc(fpc2) ///
|| household
svy: mean y
est store svyset2
gen p_su1 = cond(incidence == 3, -village, district)
gen p_su2 = cond(incidence == 3, -household, village)
gen p_su3 = cond(incidence == 3, -_n, household)
gen p_fpc1 = cond(incidence == 3, fpc2, fpc1)
gen p_fpc2 = cond(incidence == 3, 0, fpc2)
gen p_fpc3 = cond(incidence == 3, 1, 0)
svyset p_su1 [pw=pw] , fpc(p_fpc1) strata(p_str) ///
|| p_su2 , fpc(p_fpc2) ///
|| p_su3 , fpc(p_fpc3)
svy: mean y
est store svyset3
// NOTE: The point and SE estimates are the same, the only real
difference is
// in the design degrees of freedom -e(df_r)-.
est table svyset1 svyset2 svyset3, se stats(df_r)
***** END: certainty.do
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/