Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Cluster Robust Standard Errors for Cross Country Data


From   Abekah Nkrumah <ankrumah@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: Cluster Robust Standard Errors for Cross Country Data
Date   Thu, 5 Jul 2012 16:05:38 +0100

Dear Steve,

Thanks very much for the material

Regards

Gordon

On Thu, Jul 5, 2012 at 3:52 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
> I forgot an interesting thread for comparing weighted and unweighted means
> that was started at: http://www.stata.com/statalist/archive/2011-06/msg00405.html
> Austin Nichols suggested the DuMuouchel-Duncan and Winship-Radbill references. Stas
> Kolenikov mentioned the important paper by Pfeffermann (1993), which can be found at: http://www.stat.cmu.edu/~brian/905-2008/papers/Pfeffermann-ISR-1993.pdf
>
> Reference: Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review/Revue Internationale de Statistique, 317-337.
>
>
> Steve
> sjsamuels@gmail.com
>
>
>
>
>
>> In response to my request to see the codebook that advised against using
>> weights in the Demographic and Health Surveys, Abekah Nkrumah privately sent me
>> a document:
>>
>> GUIDE TO DHS STATISTICS
>> Shea Oscar Rutstein, Ph.D. Guillermo Rojas, M.C.S., M.A.
>> Demographic and Health Surveys ORC Macro Calverton, Maryland
>> September 2006
>>
>> which states on page 14 (in reverse order):
>>
>> "5. Use of sample weights biases estimates of confidence intervals in most
>> statistical packages since the number of weighted cases is taken to produce the
>> confidence interval instead of the true number of observations. For oversampled
>> areas or groups, use of the sample weights will drastically overestimate
>> sampling variances and confidence intervals for those groups."
>>
>> My response: This paragraph refers to the fact that some statistical packages
>> are not survey-aware and so treat all weights as frequency weights. It is not
>> an argument against probability weighting.
>>
>> "4. Use of sample weights is inappropriate for estimating relationships, such as
>> regression and correlation coefficients."
>>
>> My response:
>>
>> I'm not surprised that the authors gave no justification for their assertion.
>> It's not true in general (see any advanced text) and I see no reason why it
>> would apply to the DHS without qualification. There are certainly some
>> situations where an unweighted analysis is preferable. Abekah should review at a
>> minimum the downloadable abstract of Windship and Radbill (1994) and the
>> downloadable reference by DuMouchel and Duncan (1983). Groves (1989) presents
>> an interesting example and argument. (I am traveling and so do not have a
>> page reference.) To sum up: Unless Abekah can provide substantive justification
>> for doing otherwise, he should use the weights.
>>
>>
>> References:
>>
>> W DuMouchel & G Duncan (1983) “Using Sample Survey Weights in Multiple
>> Regression Analysis.” Journal of the American Statistical Association
>> 78(383):535-543. Download at:
>> www.stat.cmu.edu/~brian/905-2008/papers/DumouchelDuncan-JASA-1983.pdf
>>
>> Groves, R. M. (1989). Survey errors and survey costs, New York: Wiley.
>>
>> Winship, C., & Radbill, L. (1994). Sampling Weights and Regression Analysis.
>> Sociological Methods & Research, 23(2), 230-257. Abstract at: smr.sagepub.com/content/23/2/230.refs
>>
>> Steve
>> sjsamuels@gmail.com
>>
>>
>
>
> You are welcome, Gordon. Could you please post a link to the study  and to the codebook that advises that weights are not necessary?
>
> Thanks,
>
> Steve
>
> On Jul 3, 2012, at 5:34 AM, Abekah Nkrumah wrote:
>
> Dear Steve,
>
> Thank you for the response. In response to your question; YES the data
> has within country sample weights and strata. The strata is the
> cluster_var. Each country is divided into clusters and from within
> each cluster households are sampled for interviews. So the strata
> variable is the same as the cluster variable. That being the case,
> what will then constitute cluster_var in the survey command that you
> gave?
>
> Secondly I have already done some estimations at the country level
> without using the survey command but correcting for possible
> intra-cluster correlations using the cluster variable. So for
> consistency I would want to continue the cross country without survey
> commands. I did not use the survey commands for simplicity and
> secondly the data code book advices that it is not necessary to
> includes sample weights in estimations. The issue then is just
> correcting the intra-cluster correlations arising from the within
> country cluster correlations at a cross country level.
>
> Nonetheless, I will appreciate your answer to the first question as
> well so I can try the two and see what differences there might be.
>
> Regards
>
> Gordon
>
> On Mon, Jul 2, 2012 at 10:56 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
>>
>> It's quite all right to combine surveys.
>>
>> Some questions for you:
>>
>> Are sampling weights provided?  I'll assume
>> so below. If not, what do you know about the sample weighting?
>> Are sampling strata within  countries identified?
>>
>> I suggest that you -svyset- the data
>>
>> ***************************
>> svyset cluster_var  [pw = sampling_weight ] , strata(country)
>> **************************
>>
>> If there were within-country strata, then define
>> ***********************************************************
>> egen super_strat = group(country stratum_var)
>> ******************************************************
>> and substitute "strata(super_strat)" in the -svyset- statement.
>>
>> Then use  commands that take a -svy- prefix. To see Stata's official survey-aware
>> commands type "help svy_estimation"
>>
>> Steve
>>
>> On Jul 2, 2012, at 5:35 PM, Abekah Nkrumah wrote:
>>
>> Dear Mark,
>>
>> Thank you very much for the response. Reading your response I was
>> wondering what the difference will be if I decide to cluster on the
>> cluster id instead of the household id. As I indicated in my earlier
>> mail, there is actually a cluster variable for each country. This
>> cluster variable contains the different clusters for each country from
>> which households were sampled. in my dataset the country with the
>> lowest number of clusters is about 412.
>>
>> Thank you very much
>>
>> On Mon, Jul 2, 2012 at 4:08 PM, Schaffer, Mark E <M.E.Schaffer@hw.ac.uk> wrote:
>>> Gordon,
>>>
>>>> -----Original Message-----
>>>> From: owner-statalist@hsphsun2.harvard.edu
>>>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of
>>>> Abekah Nkrumah
>>>> Sent: 02 July 2012 10:32
>>>> To: statalist@hsphsun2.harvard.edu
>>>> Subject: st: Cluster Robust Standard Errors for Cross Country Data
>>>>
>>>> Dear Stata List,
>>>>
>>>> I have pooled cross-section household datasets from 20
>>>> countries. For each of these countries, the data was
>>>> collected via cluster sampling meaning there will be
>>>> intra-cluster correlations which will affect the validity of
>>>> the standard errors. If I were carrying out my estimations on
>>>> a single country I know that I could correct for the possible
>>>> bias in the standard errors by using the variable containing
>>>> the cluster ids to estimate cluster robust standard errors.
>>>>
>>>> In the present case where I have pooled (i.e appended as in
>>>> stata) the household cross-section data from 20 different
>>>> countries, will it be right to still use the variable
>>>> containing the cluster ids to estimate the cluster robust
>>>> standard errors? Note that now the cluster ids will be for
>>>> all 20 countries.
>>>
>>> This is problematic.  The consistency of the cluster-robust covariance
>>> estimator is asymptotic in the number of clusters, and 20 isn't very far
>>> on the way to infinity.  Clustering on country is probably not a great
>>> idea.
>>>
>>> An alternative is to cluster on household ID and to use country dummies
>>> when you pool the data.  This would allow for arbitrary within-household
>>> correlation (via clustering on household ID) and invariant
>>> within-country correlation (via the country dummies).
>>>
>>> HTH,
>>> Mark
>>>
>>>> I will appreciate your help.
>>>>
>>>> Thank you very much
>>>>
>>>> Gordon
>>>>
>>>> --

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index