# Re: st: Analyzing a subpopulation in Stata 10.1

 From jpitblado@stata.com (Jeff Pitblado, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: Analyzing a subpopulation in Stata 10.1 Date Mon, 29 Jun 2009 06:00:00 -0500

```This post contains two replies.  A short one for Figen's original question, and
a long one to Michael's comments/questions.

--

"Karadogan, Figen" <fo145502@ohio.edu> is estimating subpopulation proportions
using survey data containing missing values and poststratification weights.
Figen asks which of the following two calls to -svy: proportion- is reporting
the results of interest.

(1)	. svy, subpop(dmrc_dum if agerc25==1): proportion dentistrc if
> !missing(dmrc_dum, dentistrc, agerc25, edurc), over(edurc)
(output omitted)

Number of obs = 3772

(2)	. svy, subpop(dmrc_dum if agerc25==1): proportion dentistrc if
> !missing(dmrc_dum, dentistrc, agerc25), over(edurc)
(output omitted)

Number of obs = 3783

The only difference between these two commands is how -svy: proportion-
handles missing values in the -edurc- variable.

In (1), all observations with missing values in -edurc- are dropped from the
estimation sample.

In (2), observations with missing values in -edurc- are only dropped from the
estimation sample if they are within the subpopulation sample or if one of
-dmrc_dum-, -dentistrc-, or -agerc25- is also missing.

There are 11 more observations in (2)'s estimation sample, these observations
contain missing values in -edurc- but are not in the subpopulation sample.

The answer to Figen's question depends on how Figen believes these
observations should be handled.  Barring any substantive reason for dropping
these observations, Figen could go with the results from (2).

--

"Michael I. Lichter" <MLichter@Buffalo.EDU> replied with a test dataset,
some comments, and a question on how Stata's -svy- commands handle
poststratification adjustments in subpopulation estimation.

I've included most of Michael's original posting at the end of this email.

Michael simulated a simple dataset to illustrate that the poststratification
adjustment is dependent upon the estimation sample.

femV1	-- identifies women in the dataset

everV1	-- identifies women who have ever-given-birth

everV1	-- identifies women who have ever-given-birth, but has some
values are missing at random

native	-- identifies native status

postwt	-- population size for native status (3000 natives, 1000 immigrants)

The survey characteristics are thus:

. svyset, poststrata(native) postweight(postwt)

Here are some partial tabulations of Michael's data:

. tab femV1 everV1, missing

|              everV1
femV1 |      0-No      1-Yes          . |     Total
-----------+---------------------------------+----------
0-No |         0          0         90 |        90
1-Yes |        55         55          0 |       110
-----------+---------------------------------+----------
Total |        55         55         90 |       200

. tab femV1 everV2, missing

|              everV2
femV1 |      0-No      1-Yes          . |     Total
-----------+---------------------------------+----------
0-No |         0          0         90 |        90
1-Yes |        44         50         16 |       110
-----------+---------------------------------+----------
Total |        44         50        106 |       200

. tab native

native |      Freq.     Percent        Cum.
------------+-----------------------------------
0-No |        101       50.50       50.50
1-Yes |         99       49.50      100.00
------------+-----------------------------------
Total |        200      100.00

Thus, if the estimation sample is the entire dataset, we expect the adjusted
sampling weights to be

9.9 == 1000/101	for non-native individuals
30.3 == 3000/99		for     native individuals

I generated a variable called -pw0- to contain the values.

. gen pw0 = cond(native, 3000/99, 1000/101)

Michael first tabulates -femV1-, to get an estimate of the number of women and
men in the population of interest:

// [T1] femV1 only
. svy: tab femV1, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       200
Number of PSUs     =       200                  Population size    =      4000
N. of poststrata   =         2                  Design df          =       199

----------------------------------
femV1 |      count         obs
----------+-----------------------
0-No |       1626          90
1-Yes |       2374         110
|
Total |       4000         200
----------------------------------
Key:  count     =  counts
obs       =  number of observations

Here is essentially how these values are computed:

. tab femV1 pw0

|          pw0
femV1 |   9.90099   30.30303 |     Total
-----------+----------------------+----------
0-No |        54         36 |        90
1-Yes |        47         63 |       110
-----------+----------------------+----------
Total |       101         99 |       200

. di 54*9.9 + 36*30.3
1625.4

. di 47*9.9 + 63*30.3
2374.2

Then Michael computes a tabulation of -everV2-, and notes that the numbers do
not make any sense:

// [T2] ever2 only
. svy: tab everV2, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =        94
Number of PSUs     =        94                  Population size    =      4000
N. of poststrata   =         2                  Design df          =        93

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |       1821          44
1-Yes |       2179          50
|
Total |       4000          94
----------------------------------
Key:  count     =  counts
obs       =  number of observations

Note that the number of observations went from 200 to 94; this is due to the
missing values in -everV2-, 90 missing values for men and 16 more for the
additional random missing values.  Since the estimation sample changed, we need to generate a new set of adjusted sampling weights.

. tab native if !missing(everV2)

native |      Freq.     Percent        Cum.
------------+-----------------------------------
0-No |         36       38.30       38.30
1-Yes |         58       61.70      100.00
------------+-----------------------------------
Total |         94      100.00

Thus we expect the adjusted sampling weights to be

27.7 == 1000/36		for non-native individuals
51.7 == 3000/58		for     native individuals

I generated a variable called -pw1- to contain the values.

. gen pw1 = cond(native, 3000/58, 1000/36)

So the above counts come from the following calculation:

. tab everV2 pw1

|          pw1
everV2 |  27.77778   51.72414 |     Total
-----------+----------------------+----------
0-No |        19         25 |        44
1-Yes |        17         33 |        50
-----------+----------------------+----------
Total |        36         58 |        94

. di 19*27.8 + 25*51.7
1820.7

. di 17*27.8 + 33*51.7
2178.7

Michael definitely has a point that these number make no sense; however, the
proper analysis of -everV2- (and -everV1- for that matter) is a subpopulation
analysis of women. This is Michael's third tabulation:

// [T3] with subpop
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       184
Number of PSUs     =       184                  Population size    =      4000
N. of poststrata   =         2                  Subpop. no. of obs =        94
Subpop. size       = 2251.0638
Design df          =       183

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |       1009          44
1-Yes |       1242          50
|
Total |       2251          94
----------------------------------
Key:  count     =  counts
obs       =  number of observations

Note that the number of observations in this analysis is still not 200.  This
is because of the random 16 women whose -everV2- value was set to missing.
Thus we need to generate a yet another set of adjusted sampling weights.  Here is a tabulation of -native- status for the estimation sample:

. tab native if !missing(everV2) | !femV1

native |      Freq.     Percent        Cum.
------------+-----------------------------------
0-No |         90       48.91       48.91
1-Yes |         94       51.09      100.00
------------+-----------------------------------
Total |        184      100.00

Thus we expect the adjusted sampling weights to be

11.1 == 1000/90		for non-native individuals
31.9 == 3000/94		for     native individuals

I generated a variable called -pw2- to contain the values.

. gen pw2 = cond(native, 3000/94, 1000/90)

And the above counts are computed via

. tab everV2 pw2 if femV1

|          pw2
everV2 |  11.11111   31.91489 |     Total
-----------+----------------------+----------
0-No |        19         25 |        44
1-Yes |        17         33 |        50
-----------+----------------------+----------
Total |        36         58 |        94

. di 19*11.1 + 25*31.9
1008.4

. di 17*11.1 + 33*31.9
1241.4

Finally, Michael repeats the above subpopulation estimation with the -missing-
option:

// [T4] with missing
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       200
Number of PSUs     =       200                  Population size    =      4000
N. of poststrata   =         2                  Subpop. no. of obs =       110
Subpop. size       = 2374.4374
Design df          =       199

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |        946          44
1-Yes |       1168          50
. |        260          16
|
Total |       2374         110
----------------------------------
Key:  count     =  counts
obs       =  number of observations

We are back to the full estimation sample so our original adjusted sampling
weights apply:

. tab everV2 pw0 if femV1, miss

|          pw0
everV2 |   9.90099   30.30303 |     Total
-----------+----------------------+----------
0-No |        19         25 |        44
1-Yes |        17         33 |        50
. |        11          5 |        16
-----------+----------------------+----------
Total |        47         63 |       110

. di 19*9.9 + 25*30.3
945.6

. di 17*9.9 + 33*30.3
1168.2

. di 11*9.9 +  5*30.3
260.4

Michael concluded his posting with:

> So, Stata does adjust the subpopulation weights, but it doesn't adjust
> them to the subpopulation size. What precisely is it doing? I wish I
> knew. It seems to me that adjusting to the full subpopulation size is
> the correct thing to do, but maybe I'm missing something.

Stata applies the poststratification adjustment according to the estimation
sample.  The sampling weights are adjusted to sum to the corresponding
poststratum sizes within the estimation sample.

Michael could -svyset- the adjusted sampling weights assuming the entire
dataset is the estimation sample.  This will even give more meaningful results
for T2, but he will lose some of the efficiency gained by knowing which
poststratum an observation belongs to.  Here are the results of Michael's 4
analyses if we -svyset- the -pw0- adjusted sampling weights:

. svyset [pw=pw0]

pweight: pw0
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>

. // [T1] femV1 only
. svy: tab femV1, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       200
Number of PSUs     =       200                  Population size    =      4000
Design df          =       199

----------------------------------
femV1 |      count         obs
----------+-----------------------
0-No |       1626          90
1-Yes |       2374         110
|
Total |       4000         200
----------------------------------
Key:  count     =  weighted counts
obs       =  number of observations

. // [T2] ever2 only
. svy: tab everV2, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =        94
Number of PSUs     =        94                  Population size    = 2114.0114
Design df          =        93

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |        946          44
1-Yes |       1168          50
|
Total |       2114          94
----------------------------------
Key:  count     =  weighted counts
obs       =  number of observations

. // [T3] with subpop
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       184
Number of PSUs     =       184                  Population size    =  3739.574
Subpop. no. of obs =        94
Subpop. size       = 2114.0114
Design df          =       183

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |        946          44
1-Yes |       1168          50
|
Total |       2114          94
----------------------------------
Key:  count     =  weighted counts
obs       =  number of observations

. // [T4] with missing
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       200
Number of PSUs     =       200                  Population size    =      4000
Subpop. no. of obs =       110
Subpop. size       = 2374.4374
Design df          =       199

----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |        946          44
1-Yes |       1168          50
. |        260          16
|
Total |       2374         110
----------------------------------
Key:  count     =  weighted counts
obs       =  number of observations

--Jeff

> This is kind of long, but I hope that some folks, particularly those
> with expertise on poststratification and people from StataCorp will hear
> me out.
>
> Fundamentally, Figen's question is about how Stata handles missing
> values under poststratification. It's one I don't know the answer to,
> but that perhaps somebody from StataCorp could help answer.
>
> To illustrate the problem, I've included a program and edited output.
> For those not wanting to do too much scanning, I'll summarize what it
> does and shows.
>
> I create a fictional dataset of men and women (femV1) who are randomly
> assigned to be either native or immigrant status (native; about 50% are
> native), and among the women I've created an ever-given-birth (everV1)
> variable (about 55% have). Although the sample is about 50% native/50%
> immigrant, the hypothetical population is 75% native and I've created a
> poststratification weight to deal with that. The everV1 variable is by
> definition missing for all males, but I also created a new variable
> everV2 that is missing at random for 15% of females.
>
> The first table T1 below shows that there are, after weighting, 2374
> females (110 obs) and 1626 males (90 obs) in the population of 4000 (200
> obs). If I tabulate everV2 only (T2), without specifying a
> subpopulation, we learn that there are 2179 (50) people who have given
> birth and 1821 (44) who have not. Since there are only 2374 females and
> only 55% of them have given birth, 2179 is clearly too big a number. Of
> course, Stata doesn't know that some of the cases are missing by
> definition while others are missing at random; *it has simply reweighted
> the sample to the full population size.*
>
> Now, if I tabulate everV2 for the subpopulation of females (T3), it
> shows that 1242 (50) have ever had a child and 1009 (44) have not, for a
> total of 2251 (94). Obviously, 2251 != 2374. *Why hasn't Stata adjusted
> the weights so that they add up to the full subpopulation size?* I don't
> know.
>
> If I repeat T3 with the "missing" option (T4), I get different results.
> These are the same results that I would get if I were using a static
> poststratification weight: 1168 (50) yeses, 946 (44) nos, and 260 (16)
> missing, adding up to the subpopulation size of 2374 (110). (Note that
> this is the same as including a "if ! missing(everV2)" in the subpop()
> option.) This is probably better than the seemingly arbitrary result I
> get in T2, but I'd really like at least the option for my result to be
> adjusted up to the subpopulation size.
>
> So, Stata does adjust the subpopulation weights, but it doesn't adjust
> them to the subpopulation size. What precisely is it doing? I wish I
> knew. It seems to me that adjusting to the full subpopulation size is
> the correct thing to do, but maybe I'm missing something.
>
> Of course, Figen isn't calculating counts, he's calculating proportions.
> Nevertheless, the size of errors and proportions depends on how Stata is
> counting things internally.
>
> Does this make sense? Is Stata doing the right thing? (And what *is* it
> doing in T3?)

> -----------------------------------------------------------------------------------------
> clear
> set obs 200
> set seed 06272009
> gen byte femV1 = _n <= 110              // pop, 55% female
> gen byte everV1 = (uniform() < .55) if (femV1==1)   // females only
> clonevar everV2 = everV1
> replace everV2 = . if (uniform() < .15)    // add missing values
> gen byte native = (uniform() <= .5)    // about 50% native in sample
> label define Lyes01 0 "0-No" 1 "1-Yes"
> label val femV1 native everV1 everV2 Lyes01
> gen postwt = cond(native, 3000, 1000)  // 75% native in population
> svyset, poststrata(native) postweight(postwt)
>
> svy: tab femV1, count format(%10.0f) obs   // [T1] femV1 only
> svy: tab everV2, count format(%10.0f) obs   // [T2] ever2 only
> svy, subpop(femV1): tab everV2, count format(%10.0f) obs   // [T3] with
> subpop
> svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss   // [T4]
> with missing
> -----------------------------------------------------------------------------------------
>
> . svy: tab femV1, count format(%10.0f) obs   // [T1] femV1 only
> (running tabulate on estimation sample)
>
> Number of strata   =         1                  Number of obs
> =       200
> Number of PSUs     =       200                  Population size
> =      4000
> N. of poststrata   =         2                  Design df
> =       199
>
> ----------------------------------
>     femV1 |      count         obs
> ----------+-----------------------
>      0-No |       1626          90
>     1-Yes |       2374         110
>           |
>     Total |       4000         200
> ----------------------------------
>   Key:  count     =  counts
>         obs       =  number of observations
>
> . svy: tab everV2, count format(%10.0f) obs   // [T2] everV2 only
> (running tabulate on estimation sample)
>
> Number of strata   =         1                  Number of obs
> =        94
> Number of PSUs     =        94                  Population size
> =      4000
> N. of poststrata   =         2                  Design df
> =        93
>
> ----------------------------------
>    everV2 |      count         obs
> ----------+-----------------------
>      0-No |       1821          44
>     1-Yes |       2179          50
>           |
>     Total |       4000          94
> ----------------------------------
>   Key:  count     =  counts
>         obs       =  number of observations
>
> . svy, subpop(femV1): tab everV2, count format(%10.0f) obs   // [T3]
> with subpop
> (running tabulate on estimation sample)
>
> Number of strata   =         1                  Number of obs
> =       184
> Number of PSUs     =       184                  Population size
> =      4000
> N. of poststrata   =         2                  Subpop. no. of obs
> =        94
>                                                 Subpop. size       =
> 2251.0638
>                                                 Design df
> =       183
>
> ----------------------------------
>    everV2 |      count         obs
> ----------+-----------------------
>      0-No |       1009          44
>     1-Yes |       1242          50
>           |
>     Total |       2251          94
> ----------------------------------
>   Key:  count     =  counts
>         obs       =  number of observations
>
> . svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss   //
> [T4] with miss too
> (running tabulate on estimation sample)
>
> Number of strata   =         1                  Number of obs
> =       200
> Number of PSUs     =       200                  Population size
> =      4000
> N. of poststrata   =         2                  Subpop. no. of obs
> =       110
>                                                 Subpop. size       =
> 2374.4374
>                                                 Design df
> =       199
>
> ----------------------------------
>    everV2 |      count         obs
> ----------+-----------------------
>      0-No |        946          44
>     1-Yes |       1168          50
>         . |        260          16
>           |
>     Total |       2374         110
> ----------------------------------
>   Key:  count     =  counts
>         obs       =  number of observations
>
>
> -----------------------------------------------------------------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```