Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Steve Samuels <sjsamuels@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Poststratification weighting, subpop, and missing values |

Date |
Thu, 27 Sep 2012 05:21:57 -0400 |

Oops.. I dropped some text in the first version. Steve Ricky Ubee: You saw an apparently paradoxical phenomenon: when you used a subpop() option to exclude observations with missing values of your analysis variable, the weighted population count and the number of observations reported by -svy: total- increased increased and the standard error also increased. This phenomenon is actually proper behavior. It has nothing to to do with post-stratification. It has more to do with the difference between using an -if- option and a subpop() option to subset analyses. Here is a plain example. . ***********CODE STARTS*************** . input y y 1. . 2. 1 3. 3 4. 5 5. end . svyset _n [ results omitted] . svy: total y // (1) Ignore missing y Number of strata = 1 Number of obs = 3 Number of PSUs = 3 Population size = 3 Design df = 2 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ y | 9 3.464102 -5.904826 23.90483 -------------------------------------------------------------- . svy: total y if !missing(y) // (2) -if- expression Number of strata = 1 Number of obs = 3 Number of PSUs = 3 Population size = 3 Design df = 2 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ y | 9 3.464102 -5.904826 23.90483 -------------------------------------------------------------- . svy, subpop(if !missing(y)): total y // (3) Number of strata = 1 Number of obs = 4 Number of PSUs = 4 Population size = 4 Subpop. no. obs = 3 Subpop. size = 3 Design df = 3 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ y | 9 4.434712 -5.113231 23.11323 -------------------------------------------------------------- . ************CODE ENDS******************** . In (1) & (2) the estimation results are identical, and the (weighted) population and observation counts are equal to 3, the subpopulation size. In (3), the standard error is larger and the population and average counts are equal to the total sample size: 4. In (1) if your analysis variable is missing, Stata ignores the observation. This also happens in (2), which ignores observations not in the subpopulation. In (3), the subpop() option tells Stata to consider observations *not* in the subpopulation for purposes of computing standard errors. Thus the the entire sample contributes to the analysis. For details, see any sampling text, e.g. Levy & Lemeshow (2008). Notes: 1. I've never seen a recommendation to consider observations with non-missing values as a subpopulation. The focus is more on non-response bias, and possible solutions include non-response weighting and imputation (though not for the outcome). 2. Combining subpopulations with post-strata and ordinary strata can lead to bad results. Stratified & post-stratified proportions are designed to match those of the entire population, and may not apply to the subpopulation. See Levy & Lemeshow (2008), Section 6.4., p. 148. 3. I use the clause "if !missing(y)" above, rather than "if y ~=.", because the latter would not capture missing values like ".a". Reference: Levy, Paul S, and Stanley Lemeshow. 2008. Sampling of populations : methods and applications. Wiley series in survey methodology. Hoboken, N.J: Wiley. Steve > On Sep 26, 2012, at 9:25 AM, <Ravinder.Ubee@usitc.gov> <Ravinder.Ubee@usitc.gov> wrote: > > Hi everyone, > I'm currently working on analyzing the results of a survey and have run into some strange results when using poststratification weights and the subpop modifier. An example is shown below, where we're simply totaling 2011 sales. The flag variable indicates the subpopulation we're interested in. When only limiting the population by flag, the command calculates the total over 2,624 PSUs, while when we try and further limit the population to those with flag equal to one and where total sales is not missing, it calculates over 2,639 PSUs. In the second command, STATA seems to be including the 15 missing values in its calculations. Also, the total for the more limited subpopulation is lower, which does not coincide with what we expect to happen when removing missing values and its effect on the background calculation of the adjusted weight. > > Could someone shed some light on why this is happening? > > Thank you, > Ricky Ubee > > > > > . svyset uniqueID [pweight=weight_prop], strata(strata2) singleunit(scaled) poststrata(type2) postweight(postwt4) fpc(N) > > pweight: weight_prop > VCE: linearized > Poststrata: type2 > Postweight: postwt4 > Single unit: scaled > Strata 1: strata2 > SU 1: uniqueID > FPC 1: N > > > . svy, subpop(if flag==1): total TOT_SALES_11 > (running total on estimation sample) > > Survey: Total estimation > > Number of strata = 26 Number of obs = 2624 > Number of PSUs = 2624 Population size = 23794 > N. of poststrata = 16 Subpop. no. obs = 652 > Subpop. size = 5245.94 > Design df = 2598 > > -------------------------------------------------------------- > | Linearized > | Total Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > TOT_SALES_11 | 2.20e+12 2.77e+11 1.65e+12 2.74e+12 > -------------------------------------------------------------- > Note: 2 strata omitted because they contain no subpopulation > members. > > . svy, subpop(if flag==1 & TOT_SALES_11~=.): total TOT_SALES_11 > (running total on estimation sample) > > Survey: Total estimation > > Number of strata = 26 Number of obs = 2639 > Number of PSUs = 2639 Population size = 23794 > N. of poststrata = 16 Subpop. no. obs = 652 > Subpop. size = 5222.38 > Design df = 2613 > > -------------------------------------------------------------- > | Linearized > | Total Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > TOT_SALES_11 | 2.18e+12 2.76e+11 1.64e+12 2.72e+12 > -------------------------------------------------------------- > Note: 2 strata omitted because they contain no subpopulation > members. > > > . count if flag==1 & TOT_SALES_11==. > 15 > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Poststratification weighting, subpop, and missing values***From:*<Ravinder.Ubee@usitc.gov>

- Prev by Date:
**Re: st: Poststratification weighting, subpop, and missing values** - Next by Date:
**Re: st: bivariate probit with multilevel.** - Previous by thread:
**Re: st: Poststratification weighting, subpop, and missing values** - Next by thread:
**Re: st: Poststratification weighting, subpop, and missing values** - Index(es):