Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | francesco manaresi <manaresi@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | st: Svy poststratification VS Pweighting |
Date | Mon, 21 Jun 2010 17:42:18 +0200 |
I've seen several questions on the issue of poststratification in Statalist, but would like to ask you some clarifications on the estimate of standard errors. Thank you for your kindness and availability. I have got a sample of firms which have been (supposedly) randomly drawn from a reference population, and would like to post-stratify based on two observable characteristics for which all cross-tables are available. A first strategy is simply to create poststrata and postweight variables to use with svyset. This works fine but there are some (mainly user written) commands that do not support "svy:". (I am particularly interested on matching estimators) In those cases I would like to use pweight. I've seen this answer http://www.stata.com/statalist/archive/2008-11/msg00152.html ;(and several others) which suggest using N_h/n_h as pweight (where N_h is the tot.number of observations in pop from stratum h, and n_h is the tot.number of observations in sample from stratum h). This is actually correct because corresponds to the inverse of the probability of being selected given you are belonging to a specific stratum (it is possible to prove it by applying Bayes Rule) However standard errors dramatically differ: in particular, they are much larger with the latter method wrt the former. The question is: which one should I use? And if the answer is "Stata's postweight command", how can I implement them in commands that do not support the "svy :" prefix? As an example, I tried a simple simulation with a fictitious sample out of a fictitious population , I report results for the % of firms belonging to the Northern part of Italy in four cases: 1- for the real population 2- for the unweighted sample 3- for the weighted sample, using "poststrata" and "postweight" 4- for the weighted sample, using the "N_h / n_h" formula You can see that point estimate is the same (obviously) but standard errors is much larger in case 4 wrt case 3: 1- All Population . mean nord Mean estimation Number of obs = 17796 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ nord | .5057316 .0037479 .4983853 .5130779 2- Unweighted Sample . mean nord if sample ==1 Mean estimation Number of obs = 2531 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ nord | .8411695 .0072669 .8269198 .8554192 -------------------------------------------------------------- 3- Weighted Sample with Stata svy poststratification . svy: mean nord if sample ==1 (running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 2531 Number of PSUs = 2531 Population size = 17796 N. of poststrata = 4 Design df = 2530 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ nord | .5057316 5.05e-17 .5057316 .5057316 -------------------------------------------------------------- 4- Weighted Sample, with N_h/n_h pweights . mean nord if sample ==1 [pweight=pesobis] Mean estimation Number of obs = 2531 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ nord | .5057316 .0147696 .4767698 .5346934 -------------------------------------------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/