Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: appending two survey data sets

From   Stas Kolenikov <>
Subject   Re: st: appending two survey data sets
Date   Thu, 1 Nov 2012 13:26:42 -0500

You are mixing up strata and PSU concepts. Stratum is the list from
which you sample. If you had lists of health centers within geographic
blocks of a country, then block is a stratum, and health center is a

Did you sample the same health centers in two years? If you did, you
need to make sure they have not only the same variable name, but also
the numeric codes of the specific institutions. If you had two
independent samples in two time periods, you need to create

egen int super_strata = group( year block )

Steve Samuels keeps filling you with good pointers on other issues.
Listen to him carefully.

-- Stas Kolenikov, PhD, PStat (SSC)  ::
-- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer

On Thu, Nov 1, 2012 at 9:54 AM, Ameya Bondre
<> wrote:
> Thanks Stas..
> Just one question -
> so, both the surveys had similar sampling methods:
> First blocks were sampled and then the individuals were chosen by
> health centers as the strata. This makes "block" as the primary
> sampling unit, "health center" as the stratum and "individuals" as the
> secondary sampling units?  Now, I have variables for all of these in
> both the survey data sets (2009, 2011). But I need to append these
> data sets, in order to run regressions to do a trend analysis across
> the two years. For that, I need to put the svyset command for the
> "appended dataset", before performing any regressions..
> My question is, what variables do I enter in the svyset command -
> svyset psu ( ) [pw = ] strata ( ) ... (now that I have two of each
> kind, from each data set)..?
> Thank you for your time,
> Ameya
> On Wed, Oct 31, 2012 at 3:41 PM, Stas Kolenikov <> wrote:
>> On 1, 2, 3, the short answers are "yes", "yes" and "yes". The longer
>> answers depend on what you have at hand. If you had a simple random
>> sample at each stage, then you simply muliply through the ratios (# of
>> units sampled)/(# of units in the population) to get the probability
>> of selection. A smarter survey statistician would design a PPS survey,
>> in which hospitals would be selected with probabilities proportional
>> to the measure of size (# of beds, # of hospitalized, etc.). You
>> obviously have to make the names of your survey design variables the
>> same in two data sets.
>> A short answer to 4 is to -generate int year=2009- in one data set and
>> -year=2011- in the other before appending. I am not sure as to what's
>> the best way to approach 5, as it really depends on the computing
>> capacity you may have at hand. 800 variables and 10,000 observations
>> would produce at most 64Mb data set, and one would really have to go
>> back to the hardward from late 1990s to have problems with a data set
>> of this size.
>> --
>> -- Stas Kolenikov, PhD, PStat (SSC)  ::
>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>> srbi dot com
>> -- Opinions stated in this email are mine only, and do not reflect the
>> position of my employer
>> On Wed, Oct 31, 2012 at 5:13 PM, Ameya Bondre
>> <> wrote:
>>> My name is Ameya Bondre and I am working on two survey data sets for a
>>> sustainability study, and had few questions.
>>> The study design:
>>> To give you a background - I have to compare a range of conditions
>>> (health behaviors, diseases and health services) in a region, at the
>>> end of a health program (year 2009 - endline survey), with similar
>>> conditions two years after the program stopped (year 2011 - evaluation
>>> survey, to measure sustainability of program activities). I have two
>>> data sets for the two cross-sectional surveys conducted in 2009 and
>>> 2011. The surveys are independent (as in, the sampling was done again
>>> in 2011). The populations surveyed each time, are different
>>> cross-sections of the same region. Both surveys involve the same
>>> sampling technique with "block" as the stratum, "health center" as the
>>> primary sampling unit and "respondents/mothers" as the secondary
>>> sampling unit (but the variable names for these design variables are
>>> different in 2009 and 2011 data sets). I am using STATA 10. No FPC
>>> correction has been applied as per the program reports.
>>> Questions (sampling weights and svy command):
>>> 1) I have probability weights already given in the 2009 data sets but
>>> I don't have those built in, for the 2011 data sets. I have been told
>>> that the entire sampling method was similar for both years. Am I
>>> understanding correctly that I first need to calculate weights for all
>>> observations for 2011, then append data sets, and then set up the
>>> combined data set as a "survey set"?
>>> 2) Further, do I need to create the sampling weight variable by
>>> calculating probability weights for 2011 observations (which I already
>>> have for 2009) ? if yes, what's the method to get weights - would I
>>> require the region's population (N) in 2011?
>>> 3) Do I need to create new design variables for the svyset command,
>>> after appending the two data sets? (like one variable for psu, strata,
>>> weight - taking both data sets into account)
>>> Questions (appending data sets)
>>> 4) In appending, I am not able to label the variables/observations for
>>> 2011 separately from 2009, to identify them as "2009" and "2011"
>>> variables  (as appending adds observations and I want to compare
>>> trends across both years), how do I do that?
>>> 4) Since I am using STATA 10 with limited memory and my data sets are
>>> huge (800 odd variables and sample sizes in thousands); can I append
>>> few variables at a time (that I need to analyze, for certain
>>> regressions), instead of the entire data set - would that affect the
>>> survey design of the new combined data set, after appending?
>>> Please do let me know if any question is not clear. Thanks for your time..
>>> Best,
>>> Ameya Bondre
>>> *
>>> *   For searches and help try:
>>> *
>>> *
>>> *
>> *
>> *   For searches and help try:
>> *
>> *
>> *
> --
> Dr. Ameya Bondre
> Research Analyst, Tufts University, Boston, MA
> Master of Science in Public Health (MSPH)
> Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
> MBBS, G.S Medical College and KEM Hospital, Mumbai, India
> Phone: (781) 298-1668
> Email:
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index