Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Julian Doczi <juliandoczi@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: DHS svy questions on weights and merged datasets |

Date |
Tue, 19 Jun 2012 23:17:11 +0800 |

Dear Statalist Members, I am undertaking an impact evaluation using Demographic and Household Survey (DHS) data from the Philippines and difference-in-difference regression, and have a couple of conceptual questions regarding the handling of survey data with Stata's -svy- command that I hope someone can assist me with. I have searched through Statalist and the internet already, but have not been able to find satisfying answers to these questions, though my apologies are in order if they have indeed already been asked. I am using Stata 11.2 in Windows. Firstly, when using -svy-, is it possible to simultaneously make use of the sample weights from both the "Individual Recode" of the DHS and the "Household Recode" of the DHS? - For my analysis, I am mainly using the individual recode, but I also had to merge in a variable from the household recode on the household's source of non-drinking water. When I use the -svyset- command, I specify only the sample weight of the individual recode (for those who know DHS, the variable would be "v005 / 1,000,000") as a 'pweight'. But am I creating some sort of error if I then go ahead and do a -svy: regress __ - using the merged household variable, for which its dataset has its own _different_ sample weight (again, for those who know DHS, it is the "hv005 / 1,000,000" variable and is calculated with a different formula)? Is this something that should concern me and is there any way around it? Should I apply the weight somehow to the household variable before I merge it over into the individual recode dataset? Secondly, since I wish to do an impact evaluation using difference-in-difference regression, this means that I will be using DHS data from both 1998 and 2008. Normally, to do a dif-in-dif, one merges both data sets together and creates a dummy variable for whether their time period is 1998 or 2008. Conceptually, though, do the -svy- commands function properly if the single dataset is actually composed of two different datasets? For example, in the -svyset- command, although both datasets (composed as one, long dataset) would have PSU, sample weight, and strata variables of the same type, the actual meaning of these variables for each separate dataset would be very different. For example, the 1998 DHS for the Philippines has 752 unique PSUs, ranging from 1 to 755, while its 2008 DHS has 792 unique PSUs, ranging from 1 to 794, that may or may not be the same PSUs as those from 1998 (and even if there is overlap, there is a very small probability that the PSUs would share the same numerical values). Likewise, the sample weights and strata are similarly tailored specifically to the particular dataset. So, if I merged these two datasets, I will have a situation where, for example, the merged PSU variable would range from 1 to 794 and essentially have two replicates for 752 of these values (i.e. 1, 1, 2, 2, 3, 3, etc.). So, can Stata's -svy- command handle this, or do I need to use a different command, or a different way of merging / preparing my data for dif-in-dif regression? I imagine that -svy- will just treat the data as one big dataset, but this is not correct in terms of accurate calculations of standard errors/variances, is it? Surely I am not the first person to either merge data between DHS recodes (question one) or to attempt estimations using data from two DHS (question two), so I am hoping that someone with previous experience will be able to assist me with this. As I mentioned, I searched as well as I could through Statalist, but did not come across answers to these. I also apologise in advance if the questions I am asking have fairly obvious answers; my formal econometric/statistical training to date has been very limited! Finally, just to confirm, as I have read conflicting accounts on this, for the -svyset- command using DHS, the strata variable I should use is "v022" - "Sample stratum number"? I have read that whether one uses this or uses "v023" (Sample Domain) or "v024" (Region) or "v025" (Type of place of Residence - Urban/Rural) depends specifically on how the country's particular survey was sampled. (e.g. http://www.stata.com/statalist/archive/2009-07/msg00906.html ; http://www.stata.com/statalist/archive/2011-07/msg00614.html) Based on that, here is the basic description of sampling for the 1998 and 2008 Philippine DHS, which I have summarised from the country's DHS final reports (available from MeasureDHS): "The DHS is a multi-stage stratified design, designed to represent all 17 regions of the country. In each region, a stratified 3-stage sample design was employed. First, PSUs were selected with probability proportional to the estimated number of HHs from the 2000 Census. PSUs consisted of one barangay (village) or a group of contiguous barangays (villages). Second, enumeration areas (EAs) were selected within sampled PSUs with probability proportional to size. Third, housing units were selected with equal probability within EAs. EAs = area within barangay (village) consisting of ~150 contiguous HHs - these were identified during the 2000 Census." In these datasets, "v023" simply equals zero (i.e. a national focus - see: http://www.measuredhs.com/faq.cfm, question 4 under "Using Data Files"), and since a consideration of urban vs. rural is not mentioned anywhere, I assume that my strata must either be v022 (which includes about 356 unique values for 2008 {as mentioned above, v021 contains 792 unique PSUs for 2008}) or v024 (with 17 unique values for 2008). Based on my above description, can you help me decide which? Although the first link I included above discourages use of v022, other links I have seen made use of it. Although the aforementioned MeasureDHS FAQ recommends using v023, it does not specifically state what to do if v023 equals zero, as it does for me - it only says to 'investigate your specific survey'. I would greatly appreciate any assistance that could be offered, and/or further reading/resources that could assist me on these issues. Thank you very much in advance and Best Regards, -- Julian Doczi (Mr.) University of East Anglia, Norwich, U.K. juliandoczi@gmail.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: DHS svy questions on weights and merged datasets***From:*Steve Samuels <sjsamuels@gmail.com>

**Re: st: DHS svy questions on weights and merged datasets***From:*Steve Samuels <sjsamuels@gmail.com>

- Prev by Date:
**RE: st: Need Help with converting String Variables to Numeric Variables** - Next by Date:
**st: SEM with bootstrapping for analysis of mediation** - Previous by thread:
**st: use tempfile** - Next by thread:
**Re: st: DHS svy questions on weights and merged datasets** - Index(es):