Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Cluster Robust Standard Errors for Cross Country Data
Abekah Nkrumah <firstname.lastname@example.org>
Re: st: RE: Cluster Robust Standard Errors for Cross Country Data
Thu, 5 Jul 2012 16:05:38 +0100
Thanks very much for the material
On Thu, Jul 5, 2012 at 3:52 PM, Steve Samuels <email@example.com> wrote:
> I forgot an interesting thread for comparing weighted and unweighted means
> that was started at: http://www.stata.com/statalist/archive/2011-06/msg00405.html
> Austin Nichols suggested the DuMuouchel-Duncan and Winship-Radbill references. Stas
> Kolenikov mentioned the important paper by Pfeffermann (1993), which can be found at: http://www.stat.cmu.edu/~brian/905-2008/papers/Pfeffermann-ISR-1993.pdf
> Reference: Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review/Revue Internationale de Statistique, 317-337.
>> In response to my request to see the codebook that advised against using
>> weights in the Demographic and Health Surveys, Abekah Nkrumah privately sent me
>> a document:
>> GUIDE TO DHS STATISTICS
>> Shea Oscar Rutstein, Ph.D. Guillermo Rojas, M.C.S., M.A.
>> Demographic and Health Surveys ORC Macro Calverton, Maryland
>> September 2006
>> which states on page 14 (in reverse order):
>> "5. Use of sample weights biases estimates of confidence intervals in most
>> statistical packages since the number of weighted cases is taken to produce the
>> confidence interval instead of the true number of observations. For oversampled
>> areas or groups, use of the sample weights will drastically overestimate
>> sampling variances and confidence intervals for those groups."
>> My response: This paragraph refers to the fact that some statistical packages
>> are not survey-aware and so treat all weights as frequency weights. It is not
>> an argument against probability weighting.
>> "4. Use of sample weights is inappropriate for estimating relationships, such as
>> regression and correlation coefficients."
>> My response:
>> I'm not surprised that the authors gave no justification for their assertion.
>> It's not true in general (see any advanced text) and I see no reason why it
>> would apply to the DHS without qualification. There are certainly some
>> situations where an unweighted analysis is preferable. Abekah should review at a
>> minimum the downloadable abstract of Windship and Radbill (1994) and the
>> downloadable reference by DuMouchel and Duncan (1983). Groves (1989) presents
>> an interesting example and argument. (I am traveling and so do not have a
>> page reference.) To sum up: Unless Abekah can provide substantive justification
>> for doing otherwise, he should use the weights.
>> W DuMouchel & G Duncan (1983) “Using Sample Survey Weights in Multiple
>> Regression Analysis.” Journal of the American Statistical Association
>> 78(383):535-543. Download at:
>> Groves, R. M. (1989). Survey errors and survey costs, New York: Wiley.
>> Winship, C., & Radbill, L. (1994). Sampling Weights and Regression Analysis.
>> Sociological Methods & Research, 23(2), 230-257. Abstract at: smr.sagepub.com/content/23/2/230.refs
> You are welcome, Gordon. Could you please post a link to the study and to the codebook that advises that weights are not necessary?
> On Jul 3, 2012, at 5:34 AM, Abekah Nkrumah wrote:
> Dear Steve,
> Thank you for the response. In response to your question; YES the data
> has within country sample weights and strata. The strata is the
> cluster_var. Each country is divided into clusters and from within
> each cluster households are sampled for interviews. So the strata
> variable is the same as the cluster variable. That being the case,
> what will then constitute cluster_var in the survey command that you
> Secondly I have already done some estimations at the country level
> without using the survey command but correcting for possible
> intra-cluster correlations using the cluster variable. So for
> consistency I would want to continue the cross country without survey
> commands. I did not use the survey commands for simplicity and
> secondly the data code book advices that it is not necessary to
> includes sample weights in estimations. The issue then is just
> correcting the intra-cluster correlations arising from the within
> country cluster correlations at a cross country level.
> Nonetheless, I will appreciate your answer to the first question as
> well so I can try the two and see what differences there might be.
> On Mon, Jul 2, 2012 at 10:56 PM, Steve Samuels <firstname.lastname@example.org> wrote:
>> It's quite all right to combine surveys.
>> Some questions for you:
>> Are sampling weights provided? I'll assume
>> so below. If not, what do you know about the sample weighting?
>> Are sampling strata within countries identified?
>> I suggest that you -svyset- the data
>> svyset cluster_var [pw = sampling_weight ] , strata(country)
>> If there were within-country strata, then define
>> egen super_strat = group(country stratum_var)
>> and substitute "strata(super_strat)" in the -svyset- statement.
>> Then use commands that take a -svy- prefix. To see Stata's official survey-aware
>> commands type "help svy_estimation"
>> On Jul 2, 2012, at 5:35 PM, Abekah Nkrumah wrote:
>> Dear Mark,
>> Thank you very much for the response. Reading your response I was
>> wondering what the difference will be if I decide to cluster on the
>> cluster id instead of the household id. As I indicated in my earlier
>> mail, there is actually a cluster variable for each country. This
>> cluster variable contains the different clusters for each country from
>> which households were sampled. in my dataset the country with the
>> lowest number of clusters is about 412.
>> Thank you very much
>> On Mon, Jul 2, 2012 at 4:08 PM, Schaffer, Mark E <M.E.Schaffer@hw.ac.uk> wrote:
>>>> -----Original Message-----
>>>> From: email@example.com
>>>> [mailto:firstname.lastname@example.org] On Behalf Of
>>>> Abekah Nkrumah
>>>> Sent: 02 July 2012 10:32
>>>> To: email@example.com
>>>> Subject: st: Cluster Robust Standard Errors for Cross Country Data
>>>> Dear Stata List,
>>>> I have pooled cross-section household datasets from 20
>>>> countries. For each of these countries, the data was
>>>> collected via cluster sampling meaning there will be
>>>> intra-cluster correlations which will affect the validity of
>>>> the standard errors. If I were carrying out my estimations on
>>>> a single country I know that I could correct for the possible
>>>> bias in the standard errors by using the variable containing
>>>> the cluster ids to estimate cluster robust standard errors.
>>>> In the present case where I have pooled (i.e appended as in
>>>> stata) the household cross-section data from 20 different
>>>> countries, will it be right to still use the variable
>>>> containing the cluster ids to estimate the cluster robust
>>>> standard errors? Note that now the cluster ids will be for
>>>> all 20 countries.
>>> This is problematic. The consistency of the cluster-robust covariance
>>> estimator is asymptotic in the number of clusters, and 20 isn't very far
>>> on the way to infinity. Clustering on country is probably not a great
>>> An alternative is to cluster on household ID and to use country dummies
>>> when you pool the data. This would allow for arbitrary within-household
>>> correlation (via clustering on household ID) and invariant
>>> within-country correlation (via the country dummies).
>>>> I will appreciate your help.
>>>> Thank you very much
* For searches and help try: