Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: specifying SVYSET in household survey using multi-stage clustered sampling

From	Steve Samuels <[email protected]>
To	[email protected]
Subject	Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Date	Sun, 3 Oct 2010 16:35:25 -0400

Sorry. I sent an early draft of my reply and there are some remnants
of sections that I later deleted (the references to "pseudo-strata").
Please use the following:



Strata: create a new variable "my_stratum"
1. Every camp is a stratum

For the refugees living in gatherings-
2. The gatherings in each region constitute a single stratum.

Thus the number of strata will be
H = no. of camps + no. of regions

You will have to create a numbering scheme for strata that includes them both.

Define the sampling units and fpcs
In the camp strata, define
psu = building ID
fpc = estimated no. of buildings in the camp
(If you listed individual households, than for "building" above,
substitute "hh".
ssu2= hh ID
fpc2 = no. of HH in the building
ssu3 = hh ID
pfc3 = 1.0

In the region strata for gatherings define
psu = gathering ID
fpc = no. of  gatherings in the region
ssu2 = building ID
fpc2 = no. of buildings in the gathering.
ssu3 = hh ID
fpc3 = no. of HH in a selected building (might be just 1)

You need two -svyset- statements, one for estimating descriptive
statistics (e.g.means, proportions), one for regressions and other
tests of association.

****svyset for descriptive stats*****************
svyset psu [pweight= weight], strata(my_stratum)
singleunit(certainty) fpc(fpc) || ssu(ssu2) fpc(fpc2)
ssu(ssu3) fpc(fpc3)
********************

The -svyset- for analytic statistics is the same as the previous one
but omits the fpc's

****svyset for regression and tests*****************
svyset psu [pweight= weight], strata(my_ stratum)
singleunit(certainty) || ssu(ssu2) || ssu(ssu3)
********************

The incorrect degrees of freedom will probably not be much of a a
problem for country-wide statistics, but could be for region-specific
statistics. See E Korn and B Graubard (1999) Analysis of Health
Surveys, Wiley, NY, Section 5.2 (p 193), for some suggestions.


On Sun, Oct 3, 2010 at 11:06 AM, Steve Samuels <[email protected]> wrote:
> Hello, Karin.
>
> I think you need to stop calling the gathering strata "regions", and
> call them the "gatherings population in in each region" or just the
> "gathering strata". "Regions" (camps + gatherings) define an analysis
> unit.
>
> Create two data sets
> households -for analysis of hh outcomes and statistics
> Individuals: for analysis of individual outcomes & statistics.
>
> The same -svyset- statements (below) should work for each.
>
> These kinds of designs, which mingle two different sizes of PSUs,
> households in the camps and gatherings in the remainder of the each
> region, are difficult to set up and analyze. The main problem is that
> the small number of gatherings sampled in each region gives poor
> estimates of variability of and degrees of freedom (df). I'm going to
> give you a liberal set up, which will give incorrect degrees of
> freedom and give a reference to the problem at the end.
>
> Strata: create a new variable "my_stratum"
> 1. Every camp is a stratum
>
> For the refugees living in gatherings-
> 2. The gatherings in each region constitute a single stratum.
>
> Thus the number of strata will be
> H = no. of camps + no. of regions
>
> You will have to create a numbering scheme for strata that includes them both.
>
> Define the sampling units and fpcs
> In the camp strata, define
> psu = building ID
> fpc = estimated no. of buildings in the camp
> (If you listed individual households, than for "building" above,
> substitute "hh".
> ssu2= hh ID
> fpc2 = no. of HH in the building
> ssu3 = hh ID
> pfc3 = 1.0
>
> In the region strata for gatherings define
> psu = gathering ID
> fpc = no. of  gatherings in the region
> (alternatively, if gatherings in the region differ greatly in size:
> the proportion of the region gathering population in the selected
> gatherings, but there is little theory to justify this.)
> ssu2 = building ID
> fpc2 = no. of buildings in the gathering.
> ssu3 = hh ID
> fpc3 = no. of HH in a selected building (might be just 1)
>
> You need two -svyset- statements, one for estimating descriptive
> statistics (e.g.means, proportions), one for regressions and other
> tests of association.
>
> ****svyset for descriptive stats*****************
> svyset psu [pweight= weight], strata(my_stratum)
> singleunit(certainty) fpc(fpc) || ssu(ssu2) fpc(fpc2)
> ssu(ssu3) fpc(fpc3)
> ********************
>
> The -svyset- for analytic statistics is the same as the previous one
> but omits the fpc's
>
> ****svyset for regression and tests*****************
> svyset psu [pweight= weight], strata(my_ stratum)
> singleunit(certainty) || ssu(ssu2) || ssu(ssu3)
> ********************
>
> The incorrect degrees of freedom will probably not be much of a a
> problem for country-wide statistics, but could be for region-specific
> statistics. See E Korn and B Graubard (1999) Analysis of Health
> Surveys, Wiley, NY, Section 5.2 (p 193), for some suggestions.
>
>
> Best of luck,
>
> Steve
>
> Steven J. Samuels
> [email protected]
> 18 Cantine's Island
> Saugerties NY 12477
> USA
> Voice: 845-246-0774
> Fax:    206-202-4783
>
>
>
> On Sun, Oct 3, 2010 at 7:43 AM, Karin Seyfert <[email protected]> wrote:
>> Dear Steve,
>>
>> hank you for taking the time! As for your questions:
>>
>> 1. That varies across region, generally 50-60% in camps and 40-50% in
>> gatherings. This information has been provided by the agency
>> responsible for the refugees. I compared them with NGO data were
>> available and think they are good guesstimates.
>>
>> 2. In each region between two and six gatherings were selected.
>> a. We select the first gathering with a probability proportionate to
>> it's population.
>> b. If the population of the gathering selected is less than half the
>> region's gathering population, I select another gathering, otherwise I
>> stop selecting gatherings.
>> c. The second gathering is also selected with a probability
>> proportionate to it's size (the population of the first gathering
>> selected has been deducted from the gathering population of the entire
>> region)
>> 4. If the cumulative population in the two selected regions is less
>> than half the country's total population, I select another region as
>> described above, otherwise I stop selecting regions.
>>
>> 3. We sampled buildings from satellite images. The questionnaire
>> contains information on how many HH live in each building sampled.
>> More than one questionnaire could be administrated per building.
>>
>> 4. The weights are a separate issue. I am working with someone from
>> the maths department here and did not want to clutter this email or
>> the list with non-stata related problems. I will carry out the checks
>> you recommended.
>>
>> Karin
>>
>> On Sat, Oct 2, 2010 at 10:24 PM, Steve Samuels <[email protected]> wrote:
>>> Thanks Karin
>>>
>>> Some more questions and I think I can provide a workable -svyset- command
>>>
>>> 1. What proportions of the population (HH?) are inside and outside
>>> camps? How did you know this?
>>> 2. How many gatherings did you select for the sample?
>>> 3. What was the sampling process for HH in the camps camps and in the
>>> sampled gathering? I'm guessing that you listed all of them first.
>>>
>>> Not needed to do -svyset-, but important:
>>>
>>> Have you checked to see if the sum of the HH weights in the sample is
>>> close to the known number of HH for the sample and that this is true
>>> separately inside and outside the camps and for each region?
>>>
>>> Steve
>>>
>> :24 PM, Steve Samuels <[email protected]> wrote:
>>> Thanks Karin
>>>
>>> Some more questions and I think I can provide a workable -svyset- command
>>>
>>> 1. What proportions of the population (HH?) are inside and outside
>>> camps? How did you know this?
>>> 2. How many gatherings did you select for the sample?
>>> 3. What was the sampling process for HH in the camps camps and in the
>>> sampled gathering? I'm guessing that you listed all of them first.
>>>
>>> Not needed to do -svyset-, but important:
>>>
>>> Have you checked to see if the sum of the HH weights in the sample is
>>> close to the known number of HH for the sample and that this is true
>>> separately inside and outside the camps and for each region?
>>>
>>> Steve
>>>
>
> On Fri, Oct 1, 2010 at 11:33 AM, Karin Seyfert <[email protected]> wrote:
>> --
>> Dear Steve,
>>
>> Thank you so much for your quick reply. I am sorry if I was confusing,
>> but you have re-formulated the survey design correctly and much more
>> clearly.
>>
>> As for your questions:
>>
>> We did not study refugees living in neither camps nor gatherings. It
>> is assumed refugees live only in camps or gatherings.
>>
>> We collected individual information about each household member (age,
>> education, employment etc.) but also aggregate information (household
>> expenditure, household assets etc.).
>>
>> We hope to estimate descriptive proportions as well as carry out some
>> analysis (i.e. what affects household income, or at the individual
>> level, what 'predicts' health status)
>>
>> Best
>> Karin
>>
>> On Fri, Oct 1, 2010 at 5:19 PM, Steve Samuels <[email protected]> wrote:
>>> Karin,
>>>
>>> I found your description confusing. I want to reconstruct the survey
>>> design in terms that I can understand, so I'll start with the basics.
>>> Here's what I think you have done.  Please correct me if I
>>> misunderstand.
>>>
>>> 1) Your survey area is divided into regions
>>>
>>> 2) Every region had at least one camp.  You selected all camps into
>>> the study and took a sample of HH from each.
>>>
>>> 3) In all regions, refugees could also live in "gatherings" outside
>>> camps.   You selected a _sample_ of these gatherings in each region.
>>> Within each selected gathering, you took a sample of HH.
>>>
>>> Question: did you also study refugees who lived neither in camps or gatherings?
>>>
>>> Question: within HH, did you obtain aggregate information, or
>>> information about each member?
>>>
>>> You have stated that one purpose of the study is obtain estimates for
>>> each region. Are these primarily estimates of descriptive statistics
>>> (e.g. proportions?)
>>>
>>> Steve
>>>
>>> Steven J. Samuels
>>> [email protected]
>>> 18 Cantine's Island
>>> Saugerties NY 12477
>>> USA
>>> Voice: 845-246-0774
>>> Fax:    206-202-4783
>>>
>>> On Fri, Oct 1, 2010 at 2:22 AM, Karin Seyfert <[email protected]> wrote:
>>>> Dear stata List,
>>>>
>>>> we have run a large household survey among refugees.
>>>>
>>>> Refugees live in clusters of camps or outside camp gatherings within
>>>> several regions.
>>>>
>>>> We stratified our sample by 'camp' vs. 'outside camp gatherings' (1)
>>>> and region (2).
>>>> In strata (1) we under- and oversampled households to obtain robust
>>>> regional estimates.
>>>> Within strata (2), the camp/outside camp strata, we sampled households
>>>> proportional to the share of households living inside or outside
>>>> camps.
>>>>
>>>> We selected clusters within these two strata as follows:
>>>> a) We selected all camps in all regions and
>>>> b) a certain number of gatherings in all regions. Gatherings were
>>>> selected with probabilities proportionate to their population within
>>>> each region. They were sampled without replacement.
>>>>
>>>> Within the selected clusters, we used simple random sampling to select
>>>> refugee households.  Within each cluster we sampled about 5-10% of the
>>>> population. Since we are unsure about exact camp/gathering populations
>>>> and we sample a small share, we assume sampling with replacement.
>>>>
>>>> I do have sampling weights (inverse probability of a HH being
>>>> selected) and have adjusted for over- and under-sampling within the
>>>> regional strata (variable called 'weights'). Some strata contain a
>>>> singleton SU (one region has only one camp), which we treat as
>>>> certainty units.
>>>>
>>>> I am unsure how to specify -svyset-. Below is how I think the response
>>>> to -svydes- should look like. Does it look correct?  I would be
>>>> grateful for help with the question marks below. I am also unsure what
>>>> to specify as PSU, households or  clusters?
>>>>
>>>> pweight:        weights
>>>>      VCE:        linearized
>>>> Single unit:   certainty
>>>>   Strata 1:     camp/gathering
>>>>         SU 1:     ?
>>>>    FPC 1:      ?
>>>> Strata 2:      regions
>>>>      SU 2:     households
>>>>    FPC 2:     number of households per region
>>>>
>>>>
>>>> I am sorry to take your time. I would really appreciate your help!
>>>> Please also correct any mistakes or inconsistencies in my reasoning.
>>>>
>>>> Many Thanks
>>>> Karin Seyfert
>>>> PhD Candidate
>>>> School of Oriental and African Studies
>>>> University of London
>>>>
>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>

References:
- st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>

Prev by Date: Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Next by Date: re: st: Handbook on impact evaluation with Stata examples
Previous by thread: Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Next by thread: Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Index(es):
- Date
- Thread