# RE: st: Certainty units in Stata's survey commands

 From [email protected] (Jeff Pitblado, StataCorp LP) To [email protected] Subject RE: st: Certainty units in Stata's survey commands Date Wed, 09 Jan 2008 00:19:02 -0600

Steven Samuels <[email protected]> believes there is something wrong with
how Stata's -svy- commands handle certainty sampling units, i.e. sampling
units chosen with 100% certainty.

Our short reply is that there is nothing wrong here.  We believe Steven is
misinterpreting the documentation he is quoting.  Stata allows us to -svyset-
our survey data as we see fit, the only things the -svy- commands disallow are
things that violate the definitions of the survey design variables.

First we'll qualify part of Steven's quote from the [SVY] manual.  Then we'll
address Steven's concern using an hypothetical example dataset similar in
design to the one he mentions.

> The Stata V 10 "Survey Data" manual states (page 54):

> Stata's -svy- commands identify strata with an FPC equal to one as
> units sampled with certainty. To properly determine the design
> degrees of freedom, certainty units should be contained in their own
> strata, one for each certainty unit in each sampling stage. Although
> the observations contained in certainty units have a role in
> parameter estimation, they contribute nothing to the variance.

> If this is Stata's approach, then it is wrong.

To qualify the last sentence, we probably should have written:

... Although the observations contained in certainty units have a role
in parameter estimation, they contribute nothing to the variance at
the stage where the FPC is 1.

Here FPC is the variable supplied to the -fpc()- option of -svyset-.  When the
FPC variable is 1, this means a sampling rate of 100%.  When a 100% sampling
rate is plugged into the variance formula, it results in a zero multiplier,
thus the corresponding observations contribute nothing the variance for that
stage; however, these observations could and usually do contribute to the
variance in subsequent stages.

The only substantive topic left to discuss is the calculation of degrees of
freedom.  The rest of Steven's email describe an example survey dataset with 3
certainty PSUs.

> Let me give a simple example of a survey to study prevalence of a
> disease.  The target area was divided into 27 districts.  Three were
> known to have high prevalence and all were taken into the study. The
> other districts divided into 'medium' and 'low' incidence strata.  In
> these two strata,  samples of districts were drawn.  The structure of
> the design is this:

> Strata                Sampling Unit-Stage 1
> 1. District 1         village
> 2. District 2         village
> 3. District 3         village
> 4. Medium Incidence   district
> 5. Low Incidence      district

> (Subsequent stages below the village level are: household, person.)

> In other words, each of the certainty districts is not contained in
> its own stratum; it is a stratum.  Variation between the the
> primary sampling units in these strata, as in the others, does indeed
> contribute information to the variance.  This is the standard
> treatment for for self-representing" units, the traditional name for
> certainty units.  See Graham Kalton, Introduction to Survey Sampling,
> Sage Publications 1983, page 85.

In this example, Steven is treating districts 1 to 3 as strata, but the rest
of the districts are primary sampling units (PSUs) which have been partitioned
into two strata--low and medium incidence.

Suppose we have a Stata dataset with the following variables:

pw		-- the sampling weights
incidence	-- 1 is "low", 2 is "medium", 3 is "high"
district	-- numbered from 1 to the number of sampled districts;
where districts 1, 2, and 3 are assumed to be in
the "high" incidence stratum; our certainty units
village		-- a generic village identifier, contiguous numbers
starting from 1
household	-- a generic household identifier, contiguous numbers
starting from 1

With these variables also assume we have the sampling rates corresponding to
each observation for each of the sampling stages.

fpc1		-- sampling rate of the districts within stratum, thus
this variable is 1 for all observations in which
incidence is "high"
fpc2		-- sampling rate of the villages within district

To simplify things, we'll assume a small sampling rate for the households or
that the data supplier didn't publish it with in dataset.  This also makes the
sampling rate of persons within household irrelevant for variance estimation.

One might think the correct way to -svyset- this data is:

svyset     district [pw=pw], fpc(fpc1) strata(incidence)   ///
|| village         , fpc(fpc2)                     ///
|| household

This seems reasonable, be we believe that the model degrees of freedom from
this setup are slightly inflated.  Suppose in addition to districts 1, 2, and
3, that 16 other districts were sampled.  By this setup the degrees of freedom
would be 16 = 19-3.

A more conservative view is that the certainty units in the first stage should
not be counted among the design degrees of freedom.  We believe this is
strictly true for a one-stage clustered design, where 100% of the individuals
contained with the sampled clusters are observed.  For multi-stage design,
we acknowledge that this view might be considered too conservative; however,
Steven's quote from [SVY] is directly inspired by this conservative view.

We can cause Stata's -svy- commands to use this conservative method for
degrees of freedom by reassigning a unique stratum id to each of the certainty
units.  For example, since all our stratum and sampling unit identifiers are
positive integers, we can generate a pseudo-stratum identifier variable using
negated district values to uniquely identify the certainty units.

gen p_str = cond(incidence==3, -district, incidence)

We can then use the new p_str variable in the -strata()- option for the first
stage:

svyset     district [pw=pw], fpc(fpc1) strata(incidence)   ///
|| village         , fpc(fpc2)                     ///
|| household

This will result in 14=19-5 degrees of freedom.

A third approach is to generate pseudo-sampling unit variables that directly
reflect Steven's above setup.  We will use the same negated values trick to
generate the pseudo-unit variables for each stage:

gen p_su1  = cond(incidence == 3, -village,   district)
gen p_su2  = cond(incidence == 3, -household, village)
gen p_su3  = cond(incidence == 3, -_n,        household)

gen p_fpc1 = cond(incidence == 3, fpc2,       fpc1)
gen p_fpc2 = cond(incidence == 3, 0,          fpc2)
gen p_fpc3 = cond(incidence == 3, 1,          0)

Note that we had to generate pseudo-FPC variables to accommodate the fact that
we are treating districts 1, 2, and 3 as pseudo-strata, the villages within
them as pseudo-units is stage 1, the households within these village as
pseudo-units in stage 2, and the individuals within these households as
pseudo-units in stage 3.

svyset     p_su1 [pw=pw]   , fpc(p_fpc1) strata(p_str)     ///
|| p_su2           , fpc(p_fpc2)                   ///
|| p_su3           , fpc(p_fpc3)

Now suppose that of the sampled villages, 60 came from the certainty units
(districts 1, 2, and 3).  Thus there are now 76=60+16 PSUs and 5 strata,
yielding 71=76-5 degrees of freedom.  Note that 60 might be an exceedingly
high number for sampled villages for the design that Steven references; we're
using 60 to emphasize that the typical number of sampling units differ
drastically between stages thus the resulting design degrees of freedom can
vary drastically depending on how you -svyset- the data.

Now we must recognize that we set this problem up so that it was easier for us
to identify the districts, villages, and households in the dataset.  A
professional data distributer might not publish the information, especially if
it is public data; preferring the pseudo-strata and pseudo-units variables in
this case.  As survey data analysts, we should always follow the guidelines
that are published with the data we are analyzing; these guidelines
usually tell us which variables identify the survey design characteristics
that are subsequently used to -svyset- the data.

Note that the above 3 -svyset-s will yield identical point and variance
(standard error) estimates.  See the do-file that follows my signature for an
example that shows this to be true.

In closing, we refer to chapter 5 (especially section 5.2 and 5.3) of Korn and
Graubard (1999) which describes several alternative methods for computing
degrees of freedom for survey dataset with certainty units and/or strata with
one sampling unit.  Wolter (2007) also contains examples of survey datasets
that contain certainty units.

References
----------

Korn, E. L. and B. I. Graubard. 1999. Analysis of Health Surveys.
New York: Wiley.

Woleter, K. 2007. Introduction to variance estimation.
New York: Springer.

--Jeff
[email protected]

***** BEGIN: certainty.do
set seed 1234
set sortseed 1234

// incidence = stage 1 strata
// district = stage 1 units
local low = .25 + .5*uniform()
local low = 12
local med = 12
local nlow	= ceil(`low'*(.3 + .9*uniform()))
local nmed	= ceil(`med'*(.3 + .9*uniform()))
local nobs =	3 + `nlow' + `nmed'
set obs `nobs'
gen district = _n
gen incidence = cond(_n<4, 3, cond(_n<3+`nlow', 1, 2))
label define incidence 1 "low" 2 "medium" 3 "high"
label values incidence incidence

gen fpc1 = 1
replace fpc1 = `nlow'/`low' if incidence == 1
replace fpc1 = `nmed'/`med' if incidence == 2

tab district	incidence
sum fpc?

// village = stage 2 units
gen nobs = 5 + ceil(20*uniform())
expand nobs
local sort incidence district
sort `sort'
by `sort' : gen fpc2 = .1 + round(.5*uniform(), .01) if _n == 1
by `sort' : replace fpc2 = fpc2[1]
gen village = _n

// household = stage 3 units
replace nobs = 10 + ceil(10*uniform())
expand nobs
local sort incidence district village
sort `sort'
gen household = _n

// person = stage 4 units
replace nobs = ceil(6*uniform())
expand nobs

gen pw = 1/(fpc1*fpc2*.05*.05)

gen y = incidence*invnormal(uniform()) + district + village

drop nobs

tab district	incidence
sum fpc?

svyset		district [pw=pw]	, fpc(fpc1) strata(incidence)	///
||	village			, fpc(fpc2)			///
||	household

svy: mean y
est store svyset1

gen p_str	= cond(incidence == 3,	-district,	incidence)

svyset		district [pw=pw]	, fpc(fpc1) strata(p_str)	///
||	village			, fpc(fpc2)			///
||	household

svy: mean y
est store svyset2

gen p_su1	= cond(incidence == 3,	-village,	district)
gen p_su2	= cond(incidence == 3,	-household,	village)
gen p_su3	= cond(incidence == 3,	-_n,		household)
gen p_fpc1	= cond(incidence == 3,	fpc2,		fpc1)
gen p_fpc2	= cond(incidence == 3,	0,		fpc2)
gen p_fpc3	= cond(incidence == 3,	1,		0)

svyset		p_su1 [pw=pw]	, fpc(p_fpc1) strata(p_str)	///
||	p_su2		, fpc(p_fpc2)			///
||	p_su3		, fpc(p_fpc3)

svy: mean y
est store svyset3

// NOTE: The point and SE estimates are the same, the only real difference is
// in the design degrees of freedom -e(df_r)-.

est table svyset1 svyset2 svyset3, se stats(df_r)
***** END: certainty.do

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/