Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: specifying SVYSET in household survey using multi-stage clustered sampling

From	Steve Samuels <[email protected]>
To	[email protected]
Subject	Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Date	Sun, 3 Oct 2010 11:06:20 -0400

Hello, Karin.

I think you need to stop calling the gathering strata "regions", and
call them the "gatherings population in in each region" or just the
"gathering strata". "Regions" (camps + gatherings) define an analysis
unit.

Create two data sets
households -for analysis of hh outcomes and statistics
Individuals: for analysis of individual outcomes & statistics.

The same -svyset- statements (below) should work for each.

These kinds of designs, which mingle two different sizes of PSUs,
households in the camps and gatherings in the remainder of the each
region, are difficult to set up and analyze. The main problem is that
the small number of gatherings sampled in each region gives poor
estimates of variability of and degrees of freedom (df). I'm going to
give you a liberal set up, which will give incorrect degrees of
freedom and give a reference to the problem at the end.

Strata: create a new variable "my_stratum"
1. Every camp is a stratum

For the refugees living in gatherings-
2. The gatherings in each region constitute a single stratum.

Thus the number of strata will be
H = no. of camps + no. of regions

You will have to create a numbering scheme for strata that includes them both.

Define the sampling units and fpcs
In the camp strata, define
psu = building ID
fpc = estimated no. of buildings in the camp
(If you listed individual households, than for "building" above,
substitute "hh".
ssu2= hh ID
fpc2 = no. of HH in the building
ssu3 = hh ID
pfc3 = 1.0

In the region strata for gatherings define
psu = (pseudo-)gathering ID
fpc = no. of (pseudo-) gatherings in the region
(alternatively, if gatherings in the region differ greatly in size:
the proportion of the region gathering population in the selected
gatherings, but there is little theory to justify this.)
ssu2 = building ID
fpc2 = no. of buildings in the gathering.
ssu3 = hh ID
fpc3 = no. of HH in a selected building (might be just 1)

You need two -svyset- statements, one for estimating descriptive
statistics (e.g.means, proportions), one for regressions and other
tests of association.

****svyset for descriptive stats*****************
svyset psu [pweight= weight], strata(my_stratum)
singleunit(certainty) fpc(fpc) || ssu(ssu2) fpc(fpc2)
ssu(ssu3) fpc(fpc3)
********************

The -svyset- for analytic statistics is the same as the previous one
but omits the fpc's

****svyset for regression and tests*****************
svyset psu [pweight= weight], strata(my_ stratum)
singleunit(certainty) || ssu(ssu2) || ssu(ssu3)
********************

The incorrect degrees of freedom will probably not be much of a a
problem for country-wide statistics, but could be for region-specific
statistics. See E Korn and B Graubard (1999) Analysis of Health
Surveys, Wiley, NY, Section 5.2 (p 193), for some suggestions.


Best of luck,

Steve

Steven J. Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax:    206-202-4783



On Sun, Oct 3, 2010 at 7:43 AM, Karin Seyfert <[email protected]> wrote:
> Dear Steve,
>
> hank you for taking the time! As for your questions:
>
> 1. That varies across region, generally 50-60% in camps and 40-50% in
> gatherings. This information has been provided by the agency
> responsible for the refugees. I compared them with NGO data were
> available and think they are good guesstimates.
>
> 2. In each region between two and six gatherings were selected.
> a. We select the first gathering with a probability proportionate to
> it's population.
> b. If the population of the gathering selected is less than half the
> region's gathering population, I select another gathering, otherwise I
> stop selecting gatherings.
> c. The second gathering is also selected with a probability
> proportionate to it's size (the population of the first gathering
> selected has been deducted from the gathering population of the entire
> region)
> 4. If the cumulative population in the two selected regions is less
> than half the country's total population, I select another region as
> described above, otherwise I stop selecting regions.
>
> 3. We sampled buildings from satellite images. The questionnaire
> contains information on how many HH live in each building sampled.
> More than one questionnaire could be administrated per building.
>
> 4. The weights are a separate issue. I am working with someone from
> the maths department here and did not want to clutter this email or
> the list with non-stata related problems. I will carry out the checks
> you recommended.
>
> Karin
>
> On Sat, Oct 2, 2010 at 10:24 PM, Steve Samuels <[email protected]> wrote:
>> Thanks Karin
>>
>> Some more questions and I think I can provide a workable -svyset- command
>>
>> 1. What proportions of the population (HH?) are inside and outside
>> camps? How did you know this?
>> 2. How many gatherings did you select for the sample?
>> 3. What was the sampling process for HH in the camps camps and in the
>> sampled gathering? I'm guessing that you listed all of them first.
>>
>> Not needed to do -svyset-, but important:
>>
>> Have you checked to see if the sum of the HH weights in the sample is
>> close to the known number of HH for the sample and that this is true
>> separately inside and outside the camps and for each region?
>>
>> Steve
>>
> :24 PM, Steve Samuels <[email protected]> wrote:
>> Thanks Karin
>>
>> Some more questions and I think I can provide a workable -svyset- command
>>
>> 1. What proportions of the population (HH?) are inside and outside
>> camps? How did you know this?
>> 2. How many gatherings did you select for the sample?
>> 3. What was the sampling process for HH in the camps camps and in the
>> sampled gathering? I'm guessing that you listed all of them first.
>>
>> Not needed to do -svyset-, but important:
>>
>> Have you checked to see if the sum of the HH weights in the sample is
>> close to the known number of HH for the sample and that this is true
>> separately inside and outside the camps and for each region?
>>
>> Steve
>>

On Fri, Oct 1, 2010 at 11:33 AM, Karin Seyfert <[email protected]> wrote:
> --
> Dear Steve,
>
> Thank you so much for your quick reply. I am sorry if I was confusing,
> but you have re-formulated the survey design correctly and much more
> clearly.
>
> As for your questions:
>
> We did not study refugees living in neither camps nor gatherings. It
> is assumed refugees live only in camps or gatherings.
>
> We collected individual information about each household member (age,
> education, employment etc.) but also aggregate information (household
> expenditure, household assets etc.).
>
> We hope to estimate descriptive proportions as well as carry out some
> analysis (i.e. what affects household income, or at the individual
> level, what 'predicts' health status)
>
> Best
> Karin
>
> On Fri, Oct 1, 2010 at 5:19 PM, Steve Samuels <[email protected]> wrote:
>> Karin,
>>
>> I found your description confusing. I want to reconstruct the survey
>> design in terms that I can understand, so I'll start with the basics.
>> Here's what I think you have done.  Please correct me if I
>> misunderstand.
>>
>> 1) Your survey area is divided into regions
>>
>> 2) Every region had at least one camp.  You selected all camps into
>> the study and took a sample of HH from each.
>>
>> 3) In all regions, refugees could also live in "gatherings" outside
>> camps.   You selected a _sample_ of these gatherings in each region.
>> Within each selected gathering, you took a sample of HH.
>>
>> Question: did you also study refugees who lived neither in camps or gatherings?
>>
>> Question: within HH, did you obtain aggregate information, or
>> information about each member?
>>
>> You have stated that one purpose of the study is obtain estimates for
>> each region. Are these primarily estimates of descriptive statistics
>> (e.g. proportions?)
>>
>> Steve
>>
>> Steven J. Samuels
>> [email protected]
>> 18 Cantine's Island
>> Saugerties NY 12477
>> USA
>> Voice: 845-246-0774
>> Fax:    206-202-4783
>>
>> On Fri, Oct 1, 2010 at 2:22 AM, Karin Seyfert <[email protected]> wrote:
>>> Dear stata List,
>>>
>>> we have run a large household survey among refugees.
>>>
>>> Refugees live in clusters of camps or outside camp gatherings within
>>> several regions.
>>>
>>> We stratified our sample by 'camp' vs. 'outside camp gatherings' (1)
>>> and region (2).
>>> In strata (1) we under- and oversampled households to obtain robust
>>> regional estimates.
>>> Within strata (2), the camp/outside camp strata, we sampled households
>>> proportional to the share of households living inside or outside
>>> camps.
>>>
>>> We selected clusters within these two strata as follows:
>>> a) We selected all camps in all regions and
>>> b) a certain number of gatherings in all regions. Gatherings were
>>> selected with probabilities proportionate to their population within
>>> each region. They were sampled without replacement.
>>>
>>> Within the selected clusters, we used simple random sampling to select
>>> refugee households.  Within each cluster we sampled about 5-10% of the
>>> population. Since we are unsure about exact camp/gathering populations
>>> and we sample a small share, we assume sampling with replacement.
>>>
>>> I do have sampling weights (inverse probability of a HH being
>>> selected) and have adjusted for over- and under-sampling within the
>>> regional strata (variable called 'weights'). Some strata contain a
>>> singleton SU (one region has only one camp), which we treat as
>>> certainty units.
>>>
>>> I am unsure how to specify -svyset-. Below is how I think the response
>>> to -svydes- should look like. Does it look correct?  I would be
>>> grateful for help with the question marks below. I am also unsure what
>>> to specify as PSU, households or  clusters?
>>>
>>> pweight:        weights
>>>      VCE:        linearized
>>> Single unit:   certainty
>>>   Strata 1:     camp/gathering
>>>         SU 1:     ?
>>>    FPC 1:      ?
>>> Strata 2:      regions
>>>      SU 2:     households
>>>    FPC 2:     number of households per region
>>>
>>>
>>> I am sorry to take your time. I would really appreciate your help!
>>> Please also correct any mistakes or inconsistencies in my reasoning.
>>>
>>> Many Thanks
>>> Karin Seyfert
>>> PhD Candidate
>>> School of Oriental and African Studies
>>> University of London
>>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>

References:
- st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Steve Samuels <[email protected]>
- Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
  - From: Karin Seyfert <[email protected]>

Prev by Date: st: Shea's R2 with xtivreg2
Next by Date: st: RE: Shea's R2 with xtivreg2
Previous by thread: Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Next by thread: Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Index(es):
- Date
- Thread