Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Poststratification weighting, subpop, and missing values

From	Steve Samuels <[email protected]>
To	[email protected]
Subject	Re: st: Poststratification weighting, subpop, and missing values
Date	Thu, 27 Sep 2012 10:03:54 -0400

> 
> 3. I use the clause "if !missing(y)" above, rather than "if y ~=.", because
> the latter would not capture missing values like ".a".


This seemed like a slick idea at 5:00 am, but Nick Cox privately reminded me of
a far better one to accomplish the same thing:

"Tony Lachenbruch pointed out in 1992 that -if y < .- saves a character
on -if y != .- or -if y ~= .- and the tip gained extra force when .a
... .z were introduced."

STB-9   ip2 . . . . . . . . . . . . . . . . . . . . . . .  A keyboard shortcut
       . . . . . . . . . . . . . . . . . . . . . . . . . .  P. A. Lachenbruch
       9/92    p.9; STB Reprints Vol 2, p.46                    (no commands)
       keyboard shortcut to indicate nonmissing values"


Using "if y<." saves eight keystrokes!

Steve


Ricky Ubee:

You saw an apparently paradoxical phenomenon: when you used a subpop()
option to exclude observations with missing values of your analysis variable, 
the weighted population count and the number of observations reported by -svy: total-
increased  and the standard error also increased.

This phenomenon is actually proper behavior. It has nothing to to do
with post-stratification. It has more to do with the difference between
using an -if- option and a subpop() option to subset analyses. Here is a
plain example.

. ***********CODE STARTS***************
.  input y

           y
1.   .
2.   1
3.   3
4.   5
5.  end
. svyset _n 
[ results omitted]
.  svy: total y                 // (1)  Ignore missing y

Number of strata =       1          Number of obs    =       3
Number of PSUs   =       3          Population size  =       3
                                  Design df        =       2
--------------------------------------------------------------
           |             Linearized
           |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         y |          9   3.464102     -5.904826    23.90483
--------------------------------------------------------------

.  svy: total y if !missing(y)  // (2) -if- expression

Number of strata =       1          Number of obs    =       3
Number of PSUs   =       3          Population size  =       3
                                  Design df        =       2
--------------------------------------------------------------
           |             Linearized
           |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         y |          9   3.464102     -5.904826    23.90483
--------------------------------------------------------------

.  svy, subpop(if !missing(y)): total y // (3) 

Number of strata =       1          Number of obs    =       4
Number of PSUs   =       4          Population size  =       4
                                  Subpop. no. obs  =       3
                                  Subpop. size     =       3
                                  Design df        =       3
--------------------------------------------------------------
           |             Linearized
           |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         y |          9   4.434712     -5.113231    23.11323
--------------------------------------------------------------

. ************CODE ENDS********************
.

In (1) & (2) the estimation results are identical, and the (weighted)
population and observation counts are equal to 3, the subpopulation
size. In (3), the standard error is larger and the population and 
average counts are equal to the total sample size: 4.

In (1) if your analysis variable is missing, Stata ignores the observation.
This also happens in (2), which ignores observations not in the subpopulation.

In (3), the subpop() option tells Stata to consider observations *not* in
the subpopulation for purposes of computing standard errors. Thus the
the entire sample contributes to the analysis. For details, see any sampling text, 
e.g. Levy & Lemeshow (2008).


Notes:

1.  I've never seen a recommendation to consider observations with non-missing
values as a subpopulation. The focus is more on non-response bias, and possible 
solutions include non-response weighting and imputation (though not for the outcome).

2. Combining subpopulations with post-strata and ordinary strata
can lead to bad results. Stratified & post-stratified proportions are
designed to match those of the entire population, and may not apply to
the subpopulation. See Levy & Lemeshow (2008), Section 6.4., p. 148.

3. I use the clause "if !missing(y)" above, rather than "if y ~=.", because
the latter would not capture missing values like ".a".


Reference: Levy, Paul S, and Stanley Lemeshow. 2008. Sampling of populations : methods and applications. Wiley series in survey methodology. Hoboken, N.J: Wiley.


Steve

> On Sep 26, 2012, at 9:25 AM, <[email protected]> <[email protected]> wrote:
> 
> Hi everyone,
> I'm currently working on analyzing the results of a survey and have run into some strange results when using poststratification weights and the subpop modifier.  An example is shown below, where we're simply totaling 2011 sales.  The flag variable indicates the subpopulation we're interested in.  When only limiting the population by flag, the command calculates the total over 2,624 PSUs, while when we try and further limit the population to those with flag equal to one and where total sales is not missing, it calculates over 2,639 PSUs.  In the second command, STATA  seems to be including the 15 missing values in its calculations.   Also, the total for the more limited subpopulation is lower, which does not coincide with what we expect to happen when removing missing values and its effect on the background calculation of the adjusted weight.
> 
> Could someone shed some light on why this is happening?
> 
> Thank you,
> Ricky Ubee
> 
> 
> 
> 
> . svyset uniqueID [pweight=weight_prop], strata(strata2) singleunit(scaled) poststrata(type2) postweight(postwt4) fpc(N)
> 
>    pweight: weight_prop
>        VCE: linearized
> Poststrata: type2
> Postweight: postwt4
> Single unit: scaled
>   Strata 1: strata2
>       SU 1: uniqueID
>      FPC 1: N
> 
> 
> . svy, subpop(if flag==1): total TOT_SALES_11
> (running total on estimation sample)
> 
> Survey: Total estimation
> 
> Number of strata =      26          Number of obs    =    2624
> Number of PSUs   =    2624          Population size  =   23794
> N. of poststrata =      16          Subpop. no. obs  =     652
>                                  Subpop. size     = 5245.94
>                                  Design df        =    2598
> 
> --------------------------------------------------------------
>                             |             Linearized
>                             |      Total   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> TOT_SALES_11 |   2.20e+12   2.77e+11      1.65e+12    2.74e+12
> --------------------------------------------------------------
> Note: 2 strata omitted because they contain no subpopulation
>    members.
> 
> . svy, subpop(if flag==1 & TOT_SALES_11~=.): total TOT_SALES_11
> (running total on estimation sample)
> 
> Survey: Total estimation
> 
> Number of strata =      26          Number of obs    =    2639
> Number of PSUs   =    2639          Population size  =   23794
> N. of poststrata =      16          Subpop. no. obs  =     652
>                                  Subpop. size     = 5222.38
>                                  Design df        =    2613
> 
> --------------------------------------------------------------
>                             |             Linearized
>                             |      Total   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> TOT_SALES_11 |   2.18e+12   2.76e+11      1.64e+12    2.72e+12
> --------------------------------------------------------------
> Note: 2 strata omitted because they contain no subpopulation
>    members.
> 
> 	  
> . count if flag==1 & TOT_SALES_11==.
> 15
> 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Poststratification weighting, subpop, and missing values
  - From: Stas Kolenikov <[email protected]>

References:
- st: Poststratification weighting, subpop, and missing values
  - From: <[email protected]>

Prev by Date: Re: st: odds ratio
Next by Date: st: sampling weight
Previous by thread: Re: st: Poststratification weighting, subpop, and missing values
Next by thread: Re: st: Poststratification weighting, subpop, and missing values
Index(es):
- Date
- Thread