st: svyset for two wave survey, oversampling, clusters
Date
Wed, 29 Sep 2004 22:32:54 -0400
Dear Statalisters,
I am trying to determine the appropriate use of "svyset" for a somewhat
complex sampling design.
Here is a simplified description of the data set and sampling design. Think
of it of as a two-wave survey of youth. In the first wave, boys were
oversampled as they were the primary focus of the research objectives. In
the second wave, a subset of the wave 1 boys were contacted for reinterview,
all of the wave 1 girls, and an additional subset of girls were drawn from
the original sampling frame and added to balance out the sample with regard
to gender. An additional consideration is that the students come from
different subsets of schools in one county. Here is what the breakdown looks
like by gender for each wave.
Wave 1 contains 6,760 boys and 626 girls
wave 1 boys: 9,763 randomly selected from all 48 public middle schools in
the county; 6,760 participated
wave 1 girls: 669 randomly selected from six schools selected to be
representative of all 48 public middle schools; 626 participated
Wave 2 contains 956 boys and 927 girls
wave 2 boys: 1,273 of the 6,760 original wave 1 boys were randomly selected
for reinterview; 956 participated in wave 2
wave 2 girls, main group: all of the 626 original wave 1 girls were selected
for reinterview; 410 participated in wave 2
wave 2 girls, supplement: 888 girls were randomly selected from the rosters
of the remaining 42 public middle schools (the schools that provided the
wave 1 sample of girls were excluded); 517 participated in wave 2
Given these specifications, I am trying to determine what svyset should look
like for three different kinds of analyses: wave 1 only, wave 2 only, and a
two-panel study using covariates in both wave 1 and wave 2. My thought is to
do the following (to keep things simple, I only adjust for the sampling
design, and do not yet adjust for nonresponse):
*note: girl is a dummy variable for gender, newgirl is a dummy variable
identifying the new supplement of girls added to wave 2, and schoolid is a
unique school identifier.
*wave 1 only
gen sampwt1 = {total # of girls enrolled in 6 sampled schools in wave 1}/669
if girl==1
gen sampwt1 = {total # of boys enrolled in all 48 schools in wave 1}/9763 if
girl==0
svyset [pweight=sampwt1], psu=schoolid
*wave 2 only
gen sampwt2 = {total # of girls enrolled in 6 sampled schools in wave 1}/626
if girl==1 & newgirl==0
gen sampwt2 = {total # of girls enrolled in remaining 42 schools in wave
1}/888 if girl==1 & newgirl==1
gen sampwt2 = {total # of boys enrolled in all 48 schools in wave 1}/1273 if
girl==0
svyset [pweight=sampwt2], psu=schoolid
*wave 1 and wave 2; note: the supplemental sample of girls dropouts out,
they have no data from w1
gen sampwt3 = {total # of girls enrolled in 6 sampled schools in wave 1}/626
if girl==1 & newgirl==0
gen sampwt3 = 0 if girl==1 & newgirl==1
gen sampwt3 = {total # of boys enrolled in all 48 schools in wave 1}/1273 if
girl==0
svyset [pweight=sampwt3], psu=schoolid
My questions are:
(1) Does these svyset specifications make sense?
(2) Should I instead or in addition use gender as a strata identifier?
(3) Should the sampling frames for wave 2 be the school population or the
respondents to wave 1?
Thanks for any feedback.
John Reynolds
Sociology, Florida State Univ.
john.reynolds@fsu.edu