[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
[email protected] (Jeff Pitblado, StataCorp LP) |

To |
[email protected] |

Subject |
RE: st: Certainty units in Stata's survey commands |

Date |
Wed, 09 Jan 2008 00:19:02 -0600 |

Steven Samuels <[email protected]> believes there is something wrong with how Stata's -svy- commands handle certainty sampling units, i.e. sampling units chosen with 100% certainty. Our short reply is that there is nothing wrong here. We believe Steven is misinterpreting the documentation he is quoting. Stata allows us to -svyset- our survey data as we see fit, the only things the -svy- commands disallow are things that violate the definitions of the survey design variables. First we'll qualify part of Steven's quote from the [SVY] manual. Then we'll address Steven's concern using an hypothetical example dataset similar in design to the one he mentions. > The Stata V 10 "Survey Data" manual states (page 54): > Stata's -svy- commands identify strata with an FPC equal to one as > units sampled with certainty. To properly determine the design > degrees of freedom, certainty units should be contained in their own > strata, one for each certainty unit in each sampling stage. Although > the observations contained in certainty units have a role in > parameter estimation, they contribute nothing to the variance. > If this is Stata's approach, then it is wrong. To qualify the last sentence, we probably should have written: ... Although the observations contained in certainty units have a role in parameter estimation, they contribute nothing to the variance at the stage where the FPC is 1. Here FPC is the variable supplied to the -fpc()- option of -svyset-. When the FPC variable is 1, this means a sampling rate of 100%. When a 100% sampling rate is plugged into the variance formula, it results in a zero multiplier, thus the corresponding observations contribute nothing the variance for that stage; however, these observations could and usually do contribute to the variance in subsequent stages. The only substantive topic left to discuss is the calculation of degrees of freedom. The rest of Steven's email describe an example survey dataset with 3 certainty PSUs. > Let me give a simple example of a survey to study prevalence of a > disease. The target area was divided into 27 districts. Three were > known to have high prevalence and all were taken into the study. The > other districts divided into 'medium' and 'low' incidence strata. In > these two strata, samples of districts were drawn. The structure of > the design is this: > Strata Sampling Unit-Stage 1 > 1. District 1 village > 2. District 2 village > 3. District 3 village > 4. Medium Incidence district > 5. Low Incidence district > (Subsequent stages below the village level are: household, person.) > In other words, each of the certainty districts is not contained in > its own stratum; it is a stratum. Variation between the the > primary sampling units in these strata, as in the others, does indeed > contribute information to the variance. This is the standard > treatment for for self-representing" units, the traditional name for > certainty units. See Graham Kalton, Introduction to Survey Sampling, > Sage Publications 1983, page 85. In this example, Steven is treating districts 1 to 3 as strata, but the rest of the districts are primary sampling units (PSUs) which have been partitioned into two strata--low and medium incidence. Suppose we have a Stata dataset with the following variables: pw -- the sampling weights incidence -- 1 is "low", 2 is "medium", 3 is "high" district -- numbered from 1 to the number of sampled districts; where districts 1, 2, and 3 are assumed to be in the "high" incidence stratum; our certainty units village -- a generic village identifier, contiguous numbers starting from 1 household -- a generic household identifier, contiguous numbers starting from 1 With these variables also assume we have the sampling rates corresponding to each observation for each of the sampling stages. fpc1 -- sampling rate of the districts within stratum, thus this variable is 1 for all observations in which incidence is "high" fpc2 -- sampling rate of the villages within district To simplify things, we'll assume a small sampling rate for the households or that the data supplier didn't publish it with in dataset. This also makes the sampling rate of persons within household irrelevant for variance estimation. One might think the correct way to -svyset- this data is: svyset district [pw=pw], fpc(fpc1) strata(incidence) /// || village , fpc(fpc2) /// || household This seems reasonable, be we believe that the model degrees of freedom from this setup are slightly inflated. Suppose in addition to districts 1, 2, and 3, that 16 other districts were sampled. By this setup the degrees of freedom would be 16 = 19-3. A more conservative view is that the certainty units in the first stage should not be counted among the design degrees of freedom. We believe this is strictly true for a one-stage clustered design, where 100% of the individuals contained with the sampled clusters are observed. For multi-stage design, we acknowledge that this view might be considered too conservative; however, Steven's quote from [SVY] is directly inspired by this conservative view. We can cause Stata's -svy- commands to use this conservative method for degrees of freedom by reassigning a unique stratum id to each of the certainty units. For example, since all our stratum and sampling unit identifiers are positive integers, we can generate a pseudo-stratum identifier variable using negated district values to uniquely identify the certainty units. gen p_str = cond(incidence==3, -district, incidence) We can then use the new p_str variable in the -strata()- option for the first stage: svyset district [pw=pw], fpc(fpc1) strata(incidence) /// || village , fpc(fpc2) /// || household This will result in 14=19-5 degrees of freedom. A third approach is to generate pseudo-sampling unit variables that directly reflect Steven's above setup. We will use the same negated values trick to generate the pseudo-unit variables for each stage: gen p_su1 = cond(incidence == 3, -village, district) gen p_su2 = cond(incidence == 3, -household, village) gen p_su3 = cond(incidence == 3, -_n, household) gen p_fpc1 = cond(incidence == 3, fpc2, fpc1) gen p_fpc2 = cond(incidence == 3, 0, fpc2) gen p_fpc3 = cond(incidence == 3, 1, 0) Note that we had to generate pseudo-FPC variables to accommodate the fact that we are treating districts 1, 2, and 3 as pseudo-strata, the villages within them as pseudo-units is stage 1, the households within these village as pseudo-units in stage 2, and the individuals within these households as pseudo-units in stage 3. svyset p_su1 [pw=pw] , fpc(p_fpc1) strata(p_str) /// || p_su2 , fpc(p_fpc2) /// || p_su3 , fpc(p_fpc3) Now suppose that of the sampled villages, 60 came from the certainty units (districts 1, 2, and 3). Thus there are now 76=60+16 PSUs and 5 strata, yielding 71=76-5 degrees of freedom. Note that 60 might be an exceedingly high number for sampled villages for the design that Steven references; we're using 60 to emphasize that the typical number of sampling units differ drastically between stages thus the resulting design degrees of freedom can vary drastically depending on how you -svyset- the data. Now we must recognize that we set this problem up so that it was easier for us to identify the districts, villages, and households in the dataset. A professional data distributer might not publish the information, especially if it is public data; preferring the pseudo-strata and pseudo-units variables in this case. As survey data analysts, we should always follow the guidelines that are published with the data we are analyzing; these guidelines usually tell us which variables identify the survey design characteristics that are subsequently used to -svyset- the data. Note that the above 3 -svyset-s will yield identical point and variance (standard error) estimates. See the do-file that follows my signature for an example that shows this to be true. In closing, we refer to chapter 5 (especially section 5.2 and 5.3) of Korn and Graubard (1999) which describes several alternative methods for computing degrees of freedom for survey dataset with certainty units and/or strata with one sampling unit. Wolter (2007) also contains examples of survey datasets that contain certainty units. References ---------- Korn, E. L. and B. I. Graubard. 1999. Analysis of Health Surveys. New York: Wiley. Woleter, K. 2007. Introduction to variance estimation. New York: Springer. --Jeff [email protected] ***** BEGIN: certainty.do set seed 1234 set sortseed 1234 // incidence = stage 1 strata // district = stage 1 units local low = .25 + .5*uniform() local low = 12 local med = 12 local nlow = ceil(`low'*(.3 + .9*uniform())) local nmed = ceil(`med'*(.3 + .9*uniform())) local nobs = 3 + `nlow' + `nmed' set obs `nobs' gen district = _n gen incidence = cond(_n<4, 3, cond(_n<3+`nlow', 1, 2)) label define incidence 1 "low" 2 "medium" 3 "high" label values incidence incidence gen fpc1 = 1 replace fpc1 = `nlow'/`low' if incidence == 1 replace fpc1 = `nmed'/`med' if incidence == 2 tab district incidence sum fpc? // village = stage 2 units gen nobs = 5 + ceil(20*uniform()) expand nobs local sort incidence district sort `sort' by `sort' : gen fpc2 = .1 + round(.5*uniform(), .01) if _n == 1 by `sort' : replace fpc2 = fpc2[1] gen village = _n // household = stage 3 units replace nobs = 10 + ceil(10*uniform()) expand nobs local sort incidence district village sort `sort' gen household = _n // person = stage 4 units replace nobs = ceil(6*uniform()) expand nobs gen pw = 1/(fpc1*fpc2*.05*.05) gen y = incidence*invnormal(uniform()) + district + village drop nobs tab district incidence sum fpc? svyset district [pw=pw] , fpc(fpc1) strata(incidence) /// || village , fpc(fpc2) /// || household svy: mean y est store svyset1 gen p_str = cond(incidence == 3, -district, incidence) svyset district [pw=pw] , fpc(fpc1) strata(p_str) /// || village , fpc(fpc2) /// || household svy: mean y est store svyset2 gen p_su1 = cond(incidence == 3, -village, district) gen p_su2 = cond(incidence == 3, -household, village) gen p_su3 = cond(incidence == 3, -_n, household) gen p_fpc1 = cond(incidence == 3, fpc2, fpc1) gen p_fpc2 = cond(incidence == 3, 0, fpc2) gen p_fpc3 = cond(incidence == 3, 1, 0) svyset p_su1 [pw=pw] , fpc(p_fpc1) strata(p_str) /// || p_su2 , fpc(p_fpc2) /// || p_su3 , fpc(p_fpc3) svy: mean y est store svyset3 // NOTE: The point and SE estimates are the same, the only real difference is // in the design degrees of freedom -e(df_r)-. est table svyset1 svyset2 svyset3, se stats(df_r) ***** END: certainty.do * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Certainty units in Stata's survey commands***From:*Steven Samuels <[email protected]>

- Prev by Date:
**Re: st: Control for the dependency of observations when correlate & ANOVA** - Next by Date:
**Re: st: Calling STATA from external programs in a Windows Environment?** - Previous by thread:
**st: Control for the dependency of observations when correlate & ANOVA** - Next by thread:
**Re: st: Certainty units in Stata's survey commands** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |