Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: appending two survey data sets

 From Steve Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: appending two survey data sets Date Thu, 1 Nov 2012 22:00:17 -0400

```Ameya:

I fear that your unfamiliarity with sampling concepts is going to bite
you. I suggest that you begin by studying some of the references I
previously gave and that you consult someone who can explain to you the
issues for this survey. Best would be one of the statisticians who
designed the survey. Otherwise, look for a survey methodologist at your
university. In order to construct the weights for 2011, you will need to
know the sampling design for centers, whether it is simple random
sampling (SRS), sampling with probability proportional to size (PPS), or
something else. You might also have to understand post-stratification,
and how to do it. You will have to understand why you must use
the subpop() option, instead of -if- expressions, to analyze subgroups.

1. Stas showed you a statement for super-stratum to be calculated in the
combined data set. You apparently think this should be computed in each
year, but that is incorrect.

. egen int super_stratum = group( year block)
(but see below; you will have to modify some of the values)

2. You need to create one "psu", "final_wt", and "ssu" variable for the
combined data set. If the weight variable has the same name in both data
sets, use that. Otherwise:

. gen finalwt = wt1 if year==1999
. replace finalwt = wt2 if year==2011
(This assumes that you've calculated the weight.)

The PSU is  center ID.  This won't change between surveys.
The SSU is respondent ID

3. You need to recode super_stratum if a
block was sampled in both years.

. egen tag= tag(year center)

. sort super_stratum year
. list super_stratum year center if tag, sepby(center)

Some entries will look like

super_stratum year center
5             1999 8
6             2011 8

When the block repeats, recode super_stratum so that there is only one
for that block

. recode super_stratum 6 = 5

5             1999 8
5             2011 8

Do a similar recode for every block that that was repeated.

4. Suppose your children's age category is age_gp, then a possible
-svyset- statement is:

. svyset center [pw = finalwt] , ///
strata(super_stratum) || respondent_id , strata(agegp)

But you might need to add the poststrata() and postweight() options.

Steve

On Nov 1, 2012, at 3:00 PM, Ameya Bondre wrote:

Some health centers were sampled for both years as some blocks do overlap.

and the program reports describe it as stratified two-stage sampling,
here is the description:

"stage 1 - block as the first geographical stratum and area covered by
the health center (or health center) as the primary sampling unit
stage 2 - all eligible respondents  within the P.S.U, would be
secondary sampling units - selected by proportionate random sampling
from the P.S.U. These respondents are randomly selected from two
separate lists of children's age groups (obtained from house-listing
exercises). The respondents are mothers of children 0-5 months and
6-23 months of age." (as such I have 4 data sets in total - for each
age group and for each year, '09 and '11)

So, after appending I would have these variables:
wt1 = pw (for 2009); wt2 = pw (for 2011) and similarly psu1, psu2,
ssu1, ssu2, strata1, strata2, superstrata1, superstrata2

Just wanted to know the stata syntax for the svyset command after
appending the two data sets, for a particular age group?....

thank you for your time :)

Ameya

On Thu, Nov 1, 2012 at 11:01 AM, Steve Samuels <sjsamuels@gmail.com> wrote:
> Ameya, For an SRS design you don't need to get the population N in each
> stratum, just the number of centers in each stratum and the number of
> eligible respondents in each sampled center. The data will contain,
> obviously, the number of selected centers and selected respondents in
> each.
>
> You have a potential bias problem if the design was SRS and population
> "sizes" of the health centers were skewed, e.g. there were relative few
> "large" centers and more "small" ones. In such a case, respondents from
> smaller centers may be over-represented.. The only simple fix is
> center "size" to the regression models (see example below.)
>
>> appending adds observations and I want to compare
>> trends across both years), how do I do that?
>
> If you wish to compare means or proportions
> (let csize be a grouping of center sizes)
> ***********************
> svy: mean myvar, over(year)
> xi: svy: reg myvar i.year
>
> svy: mean myvar over(year csize)
> xi: svy: reg myvar i.year i.csize i.year*i.csize
> **********************
>
> For some sampling references, see:
> http://www.stata.com/statalist/archive/2012-09/msg01058.html.
>
>
> Steve
>
> On Oct 31, 2012, at 6:41 PM, Stas Kolenikov wrote:
>
> On 1, 2, 3, the short answers are "yes", "yes" and "yes". The longer
> answers depend on what you have at hand. If you had a simple random
> sample at each stage, then you simply muliply through the ratios (# of
> units sampled)/(# of units in the population) to get the probability
> of selection. A smarter survey statistician would design a PPS survey,
> in which hospitals would be selected with probabilities proportional
> to the measure of size (# of beds, # of hospitalized, etc.). You
> obviously have to make the names of your survey design variables the
> same in two data sets.
>
> A short answer to 4 is to -generate int year=2009- in one data set and
> -year=2011- in the other before appending. I am not sure as to what's
> the best way to approach 5, as it really depends on the computing
> capacity you may have at hand. 800 variables and 10,000 observations
> would produce at most 64Mb data set, and one would really have to go
> back to the hardward from late 1990s to have problems with a data set
> of this size.
>
> "
>
>
> --
> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
> srbi dot com
> -- Opinions stated in this email are mine only, and do not reflect the
> position of my employer
>
> On Wed, Oct 31, 2012 at 5:13 PM, Ameya Bondre
> <ameyabondre.jhsph@gmail.com> wrote:
>> My name is Ameya Bondre and I am working on two survey data sets for a
>> sustainability study, and had few questions.
>>
>> The study design:
>>
>> To give you a background - I have to compare a range of conditions
>> (health behaviors, diseases and health services) in a region, at the
>> end of a health program (year 2009 - endline survey), with similar
>> conditions two years after the program stopped (year 2011 - evaluation
>> survey, to measure sustainability of program activities). I have two
>> data sets for the two cross-sectional surveys conducted in 2009 and
>> 2011. The surveys are independent (as in, the sampling was done again
>> in 2011). The populations surveyed each time, are different
>> cross-sections of the same region. Both surveys involve the same
>> sampling technique with "block" as the stratum, "health center" as the
>> primary sampling unit and "respondents/mothers" as the secondary
>> sampling unit (but the variable names for these design variables are
>> different in 2009 and 2011 data sets). I am using STATA 10. No FPC
>> correction has been applied as per the program reports.
>>
>> Questions (sampling weights and svy command):
>>
>> 1) I have probability weights already given in the 2009 data sets but
>> I don't have those built in, for the 2011 data sets. I have been told
>> that the entire sampling method was similar for both years. Am I
>> understanding correctly that I first need to calculate weights for all
>> observations for 2011, then append data sets, and then set up the
>> combined data set as a "survey set"?
>>
>> 2) Further, do I need to create the sampling weight variable by
>> calculating probability weights for 2011 observations (which I already
>> have for 2009) ? if yes, what's the method to get weights - would I
>> require the region's population (N) in 2011?
>>
>> 3) Do I need to create new design variables for the svyset command,
>> after appending the two data sets? (like one variable for psu, strata,
>> weight - taking both data sets into account)
>>
>> Questions (appending data sets)
>>
>> 4) In appending, I am not able to label the variables/observations for
>> 2011 separately from 2009, to identify them as "2009" and "2011"
>> variables  (as appending adds observations and I want to compare
>> trends across both years), how do I do that?
>>
>> 4) Since I am using STATA 10 with limited memory and my data sets are
>> huge (800 odd variables and sample sizes in thousands); can I append
>> few variables at a time (that I need to analyze, for certain
>> regressions), instead of the entire data set - would that affect the
>> survey design of the new combined data set, after appending?
>>
>>
>>
>> Please do let me know if any question is not clear. Thanks for your time..
>>
>> Best,
>> Ameya Bondre
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

--
Dr. Ameya Bondre
Research Analyst, Tufts University, Boston, MA
Master of Science in Public Health (MSPH)
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
MBBS, G.S Medical College and KEM Hospital, Mumbai, India
Phone: (781) 298-1668
Email: ameyabondre.jhsph@gmail.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```