Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: specifying SVYSET in household survey using multi-stage clustered sampling


From   Steve Samuels <sjsamuels@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: specifying SVYSET in household survey using multi-stage clustered sampling
Date   Sun, 3 Oct 2010 16:36:24 -0400

Actually the references were to "pseudo-gatherings".

S.

On Sun, Oct 3, 2010 at 4:35 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
> Sorry. I sent an early draft of my reply and there are some remnants
> of sections that I later deleted (the references to "pseudo-strata").
> Please use the following:
>
>
>
> Strata: create a new variable "my_stratum"
> 1. Every camp is a stratum
>
> For the refugees living in gatherings-
> 2. The gatherings in each region constitute a single stratum.
>
> Thus the number of strata will be
> H = no. of camps + no. of regions
>
> You will have to create a numbering scheme for strata that includes them both.
>
> Define the sampling units and fpcs
> In the camp strata, define
> psu = building ID
> fpc = estimated no. of buildings in the camp
> (If you listed individual households, than for "building" above,
> substitute "hh".
> ssu2= hh ID
> fpc2 = no. of HH in the building
> ssu3 = hh ID
> pfc3 = 1.0
>
> In the region strata for gatherings define
> psu = gathering ID
> fpc = no. of  gatherings in the region
> ssu2 = building ID
> fpc2 = no. of buildings in the gathering.
> ssu3 = hh ID
> fpc3 = no. of HH in a selected building (might be just 1)
>
> You need two -svyset- statements, one for estimating descriptive
> statistics (e.g.means, proportions), one for regressions and other
> tests of association.
>
> ****svyset for descriptive stats*****************
> svyset psu [pweight= weight], strata(my_stratum)
> singleunit(certainty) fpc(fpc) || ssu(ssu2) fpc(fpc2)
> ssu(ssu3) fpc(fpc3)
> ********************
>
> The -svyset- for analytic statistics is the same as the previous one
> but omits the fpc's
>
> ****svyset for regression and tests*****************
> svyset psu [pweight= weight], strata(my_ stratum)
> singleunit(certainty) || ssu(ssu2) || ssu(ssu3)
> ********************
>
> The incorrect degrees of freedom will probably not be much of a a
> problem for country-wide statistics, but could be for region-specific
> statistics. See E Korn and B Graubard (1999) Analysis of Health
> Surveys, Wiley, NY, Section 5.2 (p 193), for some suggestions.
>
>
> On Sun, Oct 3, 2010 at 11:06 AM, Steve Samuels <sjsamuels@gmail.com> wrote:
>> Hello, Karin.
>>
>> I think you need to stop calling the gathering strata "regions", and
>> call them the "gatherings population in in each region" or just the
>> "gathering strata". "Regions" (camps + gatherings) define an analysis
>> unit.
>>
>> Create two data sets
>> households -for analysis of hh outcomes and statistics
>> Individuals: for analysis of individual outcomes & statistics.
>>
>> The same -svyset- statements (below) should work for each.
>>
>> These kinds of designs, which mingle two different sizes of PSUs,
>> households in the camps and gatherings in the remainder of the each
>> region, are difficult to set up and analyze. The main problem is that
>> the small number of gatherings sampled in each region gives poor
>> estimates of variability of and degrees of freedom (df). I'm going to
>> give you a liberal set up, which will give incorrect degrees of
>> freedom and give a reference to the problem at the end.
>>
>> Strata: create a new variable "my_stratum"
>> 1. Every camp is a stratum
>>
>> For the refugees living in gatherings-
>> 2. The gatherings in each region constitute a single stratum.
>>
>> Thus the number of strata will be
>> H = no. of camps + no. of regions
>>
>> You will have to create a numbering scheme for strata that includes them both.
>>
>> Define the sampling units and fpcs
>> In the camp strata, define
>> psu = building ID
>> fpc = estimated no. of buildings in the camp
>> (If you listed individual households, than for "building" above,
>> substitute "hh".
>> ssu2= hh ID
>> fpc2 = no. of HH in the building
>> ssu3 = hh ID
>> pfc3 = 1.0
>>
>> In the region strata for gatherings define
>> psu = gathering ID
>> fpc = no. of  gatherings in the region
>> (alternatively, if gatherings in the region differ greatly in size:
>> the proportion of the region gathering population in the selected
>> gatherings, but there is little theory to justify this.)
>> ssu2 = building ID
>> fpc2 = no. of buildings in the gathering.
>> ssu3 = hh ID
>> fpc3 = no. of HH in a selected building (might be just 1)
>>
>> You need two -svyset- statements, one for estimating descriptive
>> statistics (e.g.means, proportions), one for regressions and other
>> tests of association.
>>
>> ****svyset for descriptive stats*****************
>> svyset psu [pweight= weight], strata(my_stratum)
>> singleunit(certainty) fpc(fpc) || ssu(ssu2) fpc(fpc2)
>> ssu(ssu3) fpc(fpc3)
>> ********************
>>
>> The -svyset- for analytic statistics is the same as the previous one
>> but omits the fpc's
>>
>> ****svyset for regression and tests*****************
>> svyset psu [pweight= weight], strata(my_ stratum)
>> singleunit(certainty) || ssu(ssu2) || ssu(ssu3)
>> ********************
>>
>> The incorrect degrees of freedom will probably not be much of a a
>> problem for country-wide statistics, but could be for region-specific
>> statistics. See E Korn and B Graubard (1999) Analysis of Health
>> Surveys, Wiley, NY, Section 5.2 (p 193), for some suggestions.
>>
>>
>> Best of luck,
>>
>> Steve
>>
>> Steven J. Samuels
>> sjsamuels@gmail.com
>> 18 Cantine's Island
>> Saugerties NY 12477
>> USA
>> Voice: 845-246-0774
>> Fax:    206-202-4783
>>
>>
>>
>> On Sun, Oct 3, 2010 at 7:43 AM, Karin Seyfert <karin.seyfert@gmail.com> wrote:
>>> Dear Steve,
>>>
>>> hank you for taking the time! As for your questions:
>>>
>>> 1. That varies across region, generally 50-60% in camps and 40-50% in
>>> gatherings. This information has been provided by the agency
>>> responsible for the refugees. I compared them with NGO data were
>>> available and think they are good guesstimates.
>>>
>>> 2. In each region between two and six gatherings were selected.
>>> a. We select the first gathering with a probability proportionate to
>>> it's population.
>>> b. If the population of the gathering selected is less than half the
>>> region's gathering population, I select another gathering, otherwise I
>>> stop selecting gatherings.
>>> c. The second gathering is also selected with a probability
>>> proportionate to it's size (the population of the first gathering
>>> selected has been deducted from the gathering population of the entire
>>> region)
>>> 4. If the cumulative population in the two selected regions is less
>>> than half the country's total population, I select another region as
>>> described above, otherwise I stop selecting regions.
>>>
>>> 3. We sampled buildings from satellite images. The questionnaire
>>> contains information on how many HH live in each building sampled.
>>> More than one questionnaire could be administrated per building.
>>>
>>> 4. The weights are a separate issue. I am working with someone from
>>> the maths department here and did not want to clutter this email or
>>> the list with non-stata related problems. I will carry out the checks
>>> you recommended.
>>>
>>> Karin
>>>
>>> On Sat, Oct 2, 2010 at 10:24 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
>>>> Thanks Karin
>>>>
>>>> Some more questions and I think I can provide a workable -svyset- command
>>>>
>>>> 1. What proportions of the population (HH?) are inside and outside
>>>> camps? How did you know this?
>>>> 2. How many gatherings did you select for the sample?
>>>> 3. What was the sampling process for HH in the camps camps and in the
>>>> sampled gathering? I'm guessing that you listed all of them first.
>>>>
>>>> Not needed to do -svyset-, but important:
>>>>
>>>> Have you checked to see if the sum of the HH weights in the sample is
>>>> close to the known number of HH for the sample and that this is true
>>>> separately inside and outside the camps and for each region?
>>>>
>>>> Steve
>>>>
>>> :24 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
>>>> Thanks Karin
>>>>
>>>> Some more questions and I think I can provide a workable -svyset- command
>>>>
>>>> 1. What proportions of the population (HH?) are inside and outside
>>>> camps? How did you know this?
>>>> 2. How many gatherings did you select for the sample?
>>>> 3. What was the sampling process for HH in the camps camps and in the
>>>> sampled gathering? I'm guessing that you listed all of them first.
>>>>
>>>> Not needed to do -svyset-, but important:
>>>>
>>>> Have you checked to see if the sum of the HH weights in the sample is
>>>> close to the known number of HH for the sample and that this is true
>>>> separately inside and outside the camps and for each region?
>>>>
>>>> Steve
>>>>
>>
>> On Fri, Oct 1, 2010 at 11:33 AM, Karin Seyfert <karin.seyfert@gmail.com> wrote:
>>> --
>>> Dear Steve,
>>>
>>> Thank you so much for your quick reply. I am sorry if I was confusing,
>>> but you have re-formulated the survey design correctly and much more
>>> clearly.
>>>
>>> As for your questions:
>>>
>>> We did not study refugees living in neither camps nor gatherings. It
>>> is assumed refugees live only in camps or gatherings.
>>>
>>> We collected individual information about each household member (age,
>>> education, employment etc.) but also aggregate information (household
>>> expenditure, household assets etc.).
>>>
>>> We hope to estimate descriptive proportions as well as carry out some
>>> analysis (i.e. what affects household income, or at the individual
>>> level, what 'predicts' health status)
>>>
>>> Best
>>> Karin
>>>
>>> On Fri, Oct 1, 2010 at 5:19 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
>>>> Karin,
>>>>
>>>> I found your description confusing. I want to reconstruct the survey
>>>> design in terms that I can understand, so I'll start with the basics.
>>>> Here's what I think you have done.  Please correct me if I
>>>> misunderstand.
>>>>
>>>> 1) Your survey area is divided into regions
>>>>
>>>> 2) Every region had at least one camp.  You selected all camps into
>>>> the study and took a sample of HH from each.
>>>>
>>>> 3) In all regions, refugees could also live in "gatherings" outside
>>>> camps.   You selected a _sample_ of these gatherings in each region.
>>>> Within each selected gathering, you took a sample of HH.
>>>>
>>>> Question: did you also study refugees who lived neither in camps or gatherings?
>>>>
>>>> Question: within HH, did you obtain aggregate information, or
>>>> information about each member?
>>>>
>>>> You have stated that one purpose of the study is obtain estimates for
>>>> each region. Are these primarily estimates of descriptive statistics
>>>> (e.g. proportions?)
>>>>
>>>> Steve
>>>>
>>>> Steven J. Samuels
>>>> sjsamuels@gmail.com
>>>> 18 Cantine's Island
>>>> Saugerties NY 12477
>>>> USA
>>>> Voice: 845-246-0774
>>>> Fax:    206-202-4783
>>>>
>>>> On Fri, Oct 1, 2010 at 2:22 AM, Karin Seyfert <karin.seyfert@gmail.com> wrote:
>>>>> Dear stata List,
>>>>>
>>>>> we have run a large household survey among refugees.
>>>>>
>>>>> Refugees live in clusters of camps or outside camp gatherings within
>>>>> several regions.
>>>>>
>>>>> We stratified our sample by 'camp' vs. 'outside camp gatherings' (1)
>>>>> and region (2).
>>>>> In strata (1) we under- and oversampled households to obtain robust
>>>>> regional estimates.
>>>>> Within strata (2), the camp/outside camp strata, we sampled households
>>>>> proportional to the share of households living inside or outside
>>>>> camps.
>>>>>
>>>>> We selected clusters within these two strata as follows:
>>>>> a) We selected all camps in all regions and
>>>>> b) a certain number of gatherings in all regions. Gatherings were
>>>>> selected with probabilities proportionate to their population within
>>>>> each region. They were sampled without replacement.
>>>>>
>>>>> Within the selected clusters, we used simple random sampling to select
>>>>> refugee households.  Within each cluster we sampled about 5-10% of the
>>>>> population. Since we are unsure about exact camp/gathering populations
>>>>> and we sample a small share, we assume sampling with replacement.
>>>>>
>>>>> I do have sampling weights (inverse probability of a HH being
>>>>> selected) and have adjusted for over- and under-sampling within the
>>>>> regional strata (variable called 'weights'). Some strata contain a
>>>>> singleton SU (one region has only one camp), which we treat as
>>>>> certainty units.
>>>>>
>>>>> I am unsure how to specify -svyset-. Below is how I think the response
>>>>> to -svydes- should look like. Does it look correct?  I would be
>>>>> grateful for help with the question marks below. I am also unsure what
>>>>> to specify as PSU, households or  clusters?
>>>>>
>>>>> pweight:        weights
>>>>>      VCE:        linearized
>>>>> Single unit:   certainty
>>>>>   Strata 1:     camp/gathering
>>>>>         SU 1:     ?
>>>>>    FPC 1:      ?
>>>>> Strata 2:      regions
>>>>>      SU 2:     households
>>>>>    FPC 2:     number of households per region
>>>>>
>>>>>
>>>>> I am sorry to take your time. I would really appreciate your help!
>>>>> Please also correct any mistakes or inconsistencies in my reasoning.
>>>>>
>>>>> Many Thanks
>>>>> Karin Seyfert
>>>>> PhD Candidate
>>>>> School of Oriental and African Studies
>>>>> University of London
>>>>>
>>>
>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index