st: svyset for two wave survey, oversampling, clusters

 From "John Reynolds" To Subject st: svyset for two wave survey, oversampling, clusters Date Wed, 29 Sep 2004 22:32:54 -0400

Dear Statalisters,

I am trying to determine the appropriate use of "svyset" for a somewhat complex sampling design.

Here is a simplified description of the data set and sampling design. Think of it of as a two-wave survey of youth. In the first wave, boys were oversampled as they were the primary focus of the research objectives. In the second wave, a subset of the wave 1 boys were contacted for reinterview, all of the wave 1 girls, and an additional subset of girls were drawn from the original sampling frame and added to balance out the sample with regard to gender. An additional consideration is that the students come from different subsets of schools in one county. Here is what the breakdown looks like by gender for each wave.

Wave 1 contains 6,760 boys and 626 girls
wave 1 boys: 9,763 randomly selected from all 48 public middle schools in the county; 6,760 participated
wave 1 girls: 669 randomly selected from six schools selected to be representative of all 48 public middle schools; 626 participated

Wave 2 contains 956 boys and 927 girls
wave 2 boys: 1,273 of the 6,760 original wave 1 boys were randomly selected for reinterview; 956 participated in wave 2
wave 2 girls, main group: all of the 626 original wave 1 girls were selected for reinterview; 410 participated in wave 2
wave 2 girls, supplement: 888 girls were randomly selected from the rosters of the remaining 42 public middle schools (the schools that provided the wave 1 sample of girls were excluded); 517 participated in wave 2

Given these specifications, I am trying to determine what svyset should look like for three different kinds of analyses: wave 1 only, wave 2 only, and a two-panel study using covariates in both wave 1 and wave 2. My thought is to do the following (to keep things simple, I only adjust for the sampling design, and do not yet adjust for nonresponse):

*note: girl is a dummy variable for gender, newgirl is a dummy variable identifying the new supplement of girls added to wave 2, and schoolid is a unique school identifier.

*wave 1 only
gen sampwt1 = {total # of girls enrolled in 6 sampled schools in wave 1}/669 if girl==1
gen sampwt1 = {total # of boys enrolled in all 48 schools in wave 1}/9763 if girl==0
svyset [pweight=sampwt1], psu=schoolid

*wave 2 only
gen sampwt2 = {total # of girls enrolled in 6 sampled schools in wave 1}/626 if girl==1 & newgirl==0
gen sampwt2 = {total # of girls enrolled in remaining 42 schools in wave 1}/888 if girl==1 & newgirl==1
gen sampwt2 = {total # of boys enrolled in all 48 schools in wave 1}/1273 if girl==0
svyset [pweight=sampwt2], psu=schoolid

*wave 1 and wave 2; note: the supplemental sample of girls dropouts out, they have no data from w1
gen sampwt3 = {total # of girls enrolled in 6 sampled schools in wave 1}/626 if girl==1 & newgirl==0
gen sampwt3 = 0 if girl==1 & newgirl==1
gen sampwt3 = {total # of boys enrolled in all 48 schools in wave 1}/1273 if girl==0
svyset [pweight=sampwt3], psu=schoolid

My questions are:
(1) Does these svyset specifications make sense?
(2) Should I instead or in addition use gender as a strata identifier?
(3) Should the sampling frames for wave 2 be the school population or the respondents to wave 1?

Thanks for any feedback.

John Reynolds
Sociology, Florida State Univ.
john.reynolds@fsu.edu

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/