Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: st: Pooled Datasets (DHS) - use of syvset & regional controls


From   Jörg Kattner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   AW: st: Pooled Datasets (DHS) - use of syvset & regional controls
Date   Tue, 7 Jan 2014 13:25:29 +0000

Thanks for your response.

If I understood you correctly, we have two possibilities.

1) 
Use super-strata (assumption: samples are independent; at DHS often violated).
Our code now looks as follows (v024=regions; v025=rural/urban; v001=clusters)

gen weight = v005 / 1000000
egen stratid = group (year v024 v025), label
egen wave_psu = group (year v001), label
svyset wave_psu [pweight=weight], strata(stratid)    

2) 
Define the regions in each dataset the same way and then use the following command.
gen weight = v005 / 1000000
egen stratid = group (v024 v025), label
egen wave_psu = group (v001), label
svyset wave_psu [pweight=weight], strata(stratid) 

In most of the countries we would have to aggregate the regions from the newer datasets, to match the defined regions in the older datasets. 

How can we decide, which option is more suitable? (time/effort should not be a decision factor)

We would be grateful for any help!

Kind regards,

Lukas & Jörg


________________________________________
Von: [email protected] <[email protected]> im Auftrag von Stas Kolenikov <[email protected]>
Gesendet: Freitag, 3. Januar 2014 12:41
An: [email protected]
Betreff: Re: st: Pooled Datasets (DHS) - use of syvset & regional controls

If the samples are independent between years, you can define year as a
super-strata, and add year to the -egen group()- statement. This is
not quite true for DHS (their designs sometimes use the same
clusters), but if you don't want to spend time looking at the maps and
figuring out how the regions were defined in each wave, this is a
reasonable approach. It is conservative for the estimates of change,
in the sense that the standard errors won't account for the (likely
positive) correlation over time within the cluster.

Figuring out how to deal with these weird changes in design between
waves in DHS prompted me to work on
http://web.missouri.edu/~kolenikovs/papers/clusters-repeated-3.pdf
which eventually came out as
http://www5.statcan.gc.ca/bsolc/olc-cel/olc-cel?catno=12-001-X201100111449&lang=eng.

-- Stas Kolenikov, PhD, PStat (ASA, SSC)
-- Senior Survey Statistician, Abt SRBI
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
-- http://stas.kolenikov.name



On Fri, Jan 3, 2014 at 5:04 PM, Jörg Kattner <[email protected]> wrote:
> Dear Stata list serve members,
>
> For a research paper we would like to pool household surveys (DHS) from different years into a single dataset.
>
> We experience the following two challenges:
>
> 1)
> In order to account for the complex survey design we think we have to correctly specify the weights, stratification and clusters for each survey. Even though each survey is from the same country, they can differ slightly depending on the year.
> Thus when we pool them, we still want to correctly specify the survey design. However now the question arises how to do it. Before when doing each year by its own, we used code along the following line:
>
> gen weight = v005 / 1000000
> egen stratid = group (v024 v025), label
> svyset [pweight=weight], psu(v021) strata(stratid)
>
> The main thing that differs between the surveys is the stratification variables. Sometimes there exists already a stratification variable, sometimes we had to create one like above. Also sometimes the variable v024 (region) for example has 6 values in one year and 10 in the next year. Is it even possible to correctly stratify our dataset when we pool different surveys?
>
>
> 2)
> Since we also want to control for regional / community effects later on in our regression models (using svy: reg or svy: logit/clogit) it can be problematic if the defined regions and clusters differ between the surveys.
>
> The only solution we see, is performing single regressions for each year/survey. The drawback is that one cannot directly see whether differences in the constant term or the coefficient of maternal education between the different years/surveys are significant.
>
> Is there any other statistical method that could deal with this dilemma?
>
>
> Any help is much appreciated. Thanks a lot in advance!
>
>
> Best regards,
> Lukas & Jörg
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index