Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: DHS svy questions on weights and merged datasets

From   Julian Doczi <>
Subject   st: DHS svy questions on weights and merged datasets
Date   Tue, 19 Jun 2012 23:17:11 +0800

Dear Statalist Members,

I am undertaking an impact evaluation using Demographic and Household
Survey (DHS) data from the Philippines and difference-in-difference
regression, and have a couple of conceptual questions regarding the
handling of survey data with Stata's -svy- command that I hope someone
can assist me with. I have searched through Statalist and the internet
already, but have not been able to find satisfying answers to these
questions, though my apologies are in order if they have indeed
already been asked. I am using Stata 11.2 in Windows.

Firstly, when using -svy-, is it possible to simultaneously make use
of the sample weights from both the "Individual Recode" of the DHS and
the "Household Recode" of the DHS? - For my analysis, I am mainly
using the individual recode, but I also had to merge in a variable
from the household recode on the household's source of non-drinking
water. When I use the -svyset- command, I specify only the sample
weight of the individual recode (for those who know DHS, the variable
would be "v005 / 1,000,000") as a 'pweight'. But am I creating some
sort of error if I then go ahead and do a -svy: regress __ - using the
merged household variable, for which its dataset has its own
_different_ sample weight (again, for those who know DHS, it is the
"hv005 / 1,000,000" variable and is calculated with a different
formula)? Is this something that should concern me and is there any
way around it? Should I apply the weight somehow to the household
variable before I merge it over into the individual recode dataset?

Secondly, since I wish to do an impact evaluation using
difference-in-difference regression, this means that I will be using
DHS data from both 1998 and 2008. Normally, to do a dif-in-dif, one
merges both data sets together and creates a dummy variable for
whether their time period is 1998 or 2008. Conceptually, though, do
the -svy- commands function properly if the single dataset is actually
composed of two different datasets? For example, in the -svyset-
command, although both datasets (composed as one, long dataset) would
have PSU, sample weight, and strata variables of the same type, the
actual meaning of these variables for each separate dataset would be
very different. For example, the 1998 DHS for the Philippines has 752
unique PSUs, ranging from 1 to 755, while its 2008 DHS has 792 unique
PSUs, ranging from 1 to 794, that may or may not be the same PSUs as
those from 1998 (and even if there is overlap, there is a very small
probability that the PSUs would share the same numerical values).
Likewise, the sample weights and strata are similarly tailored
specifically to the particular dataset. So, if I merged these two
datasets, I will have a situation where, for example, the merged PSU
variable would range from 1 to 794 and essentially have two replicates
for 752 of these values (i.e. 1, 1, 2, 2, 3, 3, etc.).
So, can Stata's -svy- command handle this, or do I need to use a
different command, or a different way of merging / preparing my data
for dif-in-dif regression? I imagine that -svy- will just treat the
data as one big dataset, but this is not correct in terms of accurate
calculations of standard errors/variances, is it?

Surely I am not the first person to either merge data between DHS
recodes (question one) or to attempt estimations using data from two
DHS (question two), so I am hoping that someone with previous
experience will be able to assist me with this. As I mentioned, I
searched as well as I could through Statalist, but did not come across
answers to these. I also apologise in advance if the questions I am
asking have fairly obvious answers; my formal econometric/statistical
training to date has been very limited!

Finally, just to confirm, as I have read conflicting accounts on this,
for the -svyset- command using DHS, the strata variable I should use
is "v022" - "Sample stratum number"? I have read that whether one uses
this or uses "v023" (Sample Domain) or "v024" (Region) or "v025" (Type
of place of Residence - Urban/Rural) depends specifically on how the
country's particular survey was sampled.
(e.g. ;
Based on that, here is the basic description of sampling for the 1998
and 2008 Philippine DHS, which I have summarised from the country's
DHS final reports (available from MeasureDHS):
"The DHS is a multi-stage stratified design, designed to represent all
17 regions of the country. In each region, a stratified 3-stage sample
design was employed. First, PSUs were selected with probability
proportional to the estimated number of HHs from the 2000 Census. PSUs
consisted of one barangay (village) or a group of contiguous barangays
(villages). Second, enumeration areas (EAs) were selected within
sampled PSUs with probability proportional to size. Third, housing
units were selected with equal probability within EAs. EAs = area
within barangay (village) consisting of ~150 contiguous HHs - these
were identified during the 2000 Census."
In these datasets, "v023" simply equals zero (i.e. a national focus -
see:, question 4 under "Using Data
Files"), and since a consideration of urban vs. rural is not mentioned
anywhere, I assume that my strata must either be v022 (which includes
about 356 unique values for 2008 {as mentioned above, v021 contains
792 unique PSUs for 2008}) or v024 (with 17 unique values for 2008).
Based on my above description, can you help me decide which? Although
the first link I included above discourages use of v022, other links I
have seen made use of it. Although the aforementioned MeasureDHS FAQ
recommends using v023, it does not specifically state what to do if
v023 equals zero, as it does for me - it only says to 'investigate
your specific survey'.

I would greatly appreciate any assistance that could be offered,
and/or further reading/resources that could assist me on these issues.

Thank you very much in advance and Best Regards,

Julian Doczi (Mr.)
University of East Anglia, Norwich, U.K.
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index