[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Michael I. Lichter" <mlichter@buffalo.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Analyzing a subpopulation in Stata 10.1 |

Date |
Mon, 29 Jun 2009 14:47:22 -0400 |

Jeff,

Michael Jeff Pitblado, StataCorp LP wrote:

Michael simulated a simple dataset to illustrate that the poststratification adjustment is dependent upon the estimation sample. femV1 -- identifies women in the dataset everV1 -- identifies women who have ever-given-birth everV1 -- identifies women who have ever-given-birth, but has some values are missing at random native -- identifies native status postwt -- population size for native status (3000 natives, 1000 immigrants) The survey characteristics are thus: . svyset, poststrata(native) postweight(postwt) Here are some partial tabulations of Michael's data: . tab femV1 everV1, missing | everV1 femV1 | 0-No 1-Yes . | Total -----------+---------------------------------+----------0-No | 0 0 90 | 901-Yes | 55 55 0 | 110-----------+---------------------------------+----------Total | 55 55 90 | 200. tab femV1 everV2, missing | everV2 femV1 | 0-No 1-Yes . | Total -----------+---------------------------------+----------0-No | 0 0 90 | 901-Yes | 44 50 16 | 110-----------+---------------------------------+----------Total | 44 50 106 | 200. tab native native | Freq. Percent Cum. ------------+----------------------------------- 0-No | 101 50.50 50.50 1-Yes | 99 49.50 100.00 ------------+----------------------------------- Total | 200 100.00 Thus, if the estimation sample is the entire dataset, we expect the adjusted sampling weights to be 9.9 == 1000/101 for non-native individuals 30.3 == 3000/99 for native individuals I generated a variable called -pw0- to contain the values. . gen pw0 = cond(native, 3000/99, 1000/101) Michael first tabulates -femV1-, to get an estimate of the number of women and men in the population of interest: // [T1] femV1 only . svy: tab femV1, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 N. of poststrata = 2 Design df = 199 ---------------------------------- femV1 | count obs ----------+----------------------- 0-No | 1626 90 1-Yes | 2374 110|Total | 4000 200---------------------------------- Key: count = counts obs = number of observations Here is essentially how these values are computed: . tab femV1 pw0 | pw0 femV1 | 9.90099 30.30303 | Total -----------+----------------------+----------0-No | 54 36 | 901-Yes | 47 63 | 110-----------+----------------------+----------Total | 101 99 | 200. di 54*9.9 + 36*30.3 1625.4 . di 47*9.9 + 63*30.3 2374.2 Then Michael computes a tabulation of -everV2-, and notes that the numbers do not make any sense: // [T2] ever2 only . svy: tab everV2, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 94 Number of PSUs = 94 Population size = 4000 N. of poststrata = 2 Design df = 93 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 1821 44 1-Yes | 2179 50|Total | 4000 94---------------------------------- Key: count = counts obs = number of observations Note that the number of observations went from 200 to 94; this is due to the missing values in -everV2-, 90 missing values for men and 16 more for the additional random missing values. Since the estimation sample changed, we need to generate a new set of adjusted sampling weights. . tab native if !missing(everV2) native | Freq. Percent Cum. ------------+----------------------------------- 0-No | 36 38.30 38.30 1-Yes | 58 61.70 100.00 ------------+----------------------------------- Total | 94 100.00 Thus we expect the adjusted sampling weights to be 27.7 == 1000/36 for non-native individuals 51.7 == 3000/58 for native individuals I generated a variable called -pw1- to contain the values. . gen pw1 = cond(native, 3000/58, 1000/36) So the above counts come from the following calculation: . tab everV2 pw1 | pw1 everV2 | 27.77778 51.72414 | Total -----------+----------------------+----------0-No | 19 25 | 441-Yes | 17 33 | 50-----------+----------------------+----------Total | 36 58 | 94. di 19*27.8 + 25*51.7 1820.7 . di 17*27.8 + 33*51.7 2178.7 Michael definitely has a point that these number make no sense; however, the proper analysis of -everV2- (and -everV1- for that matter) is a subpopulation analysis of women. This is Michael's third tabulation: // [T3] with subpop . svy, subpop(femV1): tab everV2, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 184 Number of PSUs = 184 Population size = 4000 N. of poststrata = 2 Subpop. no. of obs = 94 Subpop. size = 2251.0638 Design df = 183 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 1009 44 1-Yes | 1242 50|Total | 2251 94---------------------------------- Key: count = counts obs = number of observations Note that the number of observations in this analysis is still not 200. This is because of the random 16 women whose -everV2- value was set to missing. Thus we need to generate a yet another set of adjusted sampling weights. Here is a tabulation of -native- status for the estimation sample: . tab native if !missing(everV2) | !femV1 native | Freq. Percent Cum. ------------+----------------------------------- 0-No | 90 48.91 48.91 1-Yes | 94 51.09 100.00 ------------+----------------------------------- Total | 184 100.00 Thus we expect the adjusted sampling weights to be 11.1 == 1000/90 for non-native individuals 31.9 == 3000/94 for native individuals I generated a variable called -pw2- to contain the values. . gen pw2 = cond(native, 3000/94, 1000/90) And the above counts are computed via . tab everV2 pw2 if femV1 | pw2 everV2 | 11.11111 31.91489 | Total -----------+----------------------+----------0-No | 19 25 | 441-Yes | 17 33 | 50-----------+----------------------+----------Total | 36 58 | 94. di 19*11.1 + 25*31.9 1008.4 . di 17*11.1 + 33*31.9 1241.4 Finally, Michael repeats the above subpopulation estimation with the -missing- option: // [T4] with missing . svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss (running tabulate on estimation sample) Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 N. of poststrata = 2 Subpop. no. of obs = 110 Subpop. size = 2374.4374 Design df = 199 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 946 44 1-Yes | 1168 50 . | 260 16|Total | 2374 110---------------------------------- Key: count = counts obs = number of observations We are back to the full estimation sample so our original adjusted sampling weights apply: . tab everV2 pw0 if femV1, miss | pw0 everV2 | 9.90099 30.30303 | Total -----------+----------------------+----------0-No | 19 25 | 441-Yes | 17 33 | 50. | 11 5 | 16-----------+----------------------+----------Total | 47 63 | 110. di 19*9.9 + 25*30.3 945.6 . di 17*9.9 + 33*30.3 1168.2 . di 11*9.9 + 5*30.3 260.4 Michael concluded his posting with:So, Stata does adjust the subpopulation weights, but it doesn't adjustthem to the subpopulation size. What precisely is it doing? I wish Iknew. It seems to me that adjusting to the full subpopulation size isthe correct thing to do, but maybe I'm missing something.Stata applies the poststratification adjustment according to the estimation sample. The sampling weights are adjusted to sum to the corresponding poststratum sizes within the estimation sample. Michael could -svyset- the adjusted sampling weights assuming the entire dataset is the estimation sample. This will even give more meaningful results for T2, but he will lose some of the efficiency gained by knowing which poststratum an observation belongs to. Here are the results of Michael's 4 analyses if we -svyset- the -pw0- adjusted sampling weights: . svyset [pw=pw0] pweight: pw0 VCE: linearized Single unit: missing Strata 1: <one> SU 1: <observations> FPC 1: <zero> . // [T1] femV1 only . svy: tab femV1, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 Design df = 199 ---------------------------------- femV1 | count obs ----------+----------------------- 0-No | 1626 90 1-Yes | 2374 110|Total | 4000 200---------------------------------- Key: count = weighted counts obs = number of observations . // [T2] ever2 only . svy: tab everV2, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 94 Number of PSUs = 94 Population size = 2114.0114 Design df = 93 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 946 44 1-Yes | 1168 50|Total | 2114 94---------------------------------- Key: count = weighted counts obs = number of observations . // [T3] with subpop . svy, subpop(femV1): tab everV2, count format(%10.0f) obs (running tabulate on estimation sample) Number of strata = 1 Number of obs = 184 Number of PSUs = 184 Population size = 3739.574 Subpop. no. of obs = 94 Subpop. size = 2114.0114 Design df = 183 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 946 44 1-Yes | 1168 50|Total | 2114 94---------------------------------- Key: count = weighted counts obs = number of observations. // [T4] with missing. svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss(running tabulate on estimation sample) Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 Subpop. no. of obs = 110 Subpop. size = 2374.4374 Design df = 199 ---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 946 44 1-Yes | 1168 50 . | 260 16|Total | 2374 110---------------------------------- Key: count = weighted counts obs = number of observations --Jeff jpitblado@stata.comThis is kind of long, but I hope that some folks, particularly thosewith expertise on poststratification and people from StataCorp will hearme out.Fundamentally, Figen's question is about how Stata handles missingvalues under poststratification. It's one I don't know the answer to,but that perhaps somebody from StataCorp could help answer.To illustrate the problem, I've included a program and edited output.For those not wanting to do too much scanning, I'll summarize what itdoes and shows.I create a fictional dataset of men and women (femV1) who are randomlyassigned to be either native or immigrant status (native; about 50% arenative), and among the women I've created an ever-given-birth (everV1)variable (about 55% have). Although the sample is about 50% native/50%immigrant, the hypothetical population is 75% native and I've created apoststratification weight to deal with that. The everV1 variable is bydefinition missing for all males, but I also created a new variableeverV2 that is missing at random for 15% of females.The first table T1 below shows that there are, after weighting, 2374females (110 obs) and 1626 males (90 obs) in the population of 4000 (200obs). If I tabulate everV2 only (T2), without specifying asubpopulation, we learn that there are 2179 (50) people who have givenbirth and 1821 (44) who have not. Since there are only 2374 females andonly 55% of them have given birth, 2179 is clearly too big a number. Ofcourse, Stata doesn't know that some of the cases are missing bydefinition while others are missing at random; *it has simply reweightedthe sample to the full population size.*Now, if I tabulate everV2 for the subpopulation of females (T3), itshows that 1242 (50) have ever had a child and 1009 (44) have not, for atotal of 2251 (94). Obviously, 2251 != 2374. *Why hasn't Stata adjustedthe weights so that they add up to the full subpopulation size?* I don'tknow.If I repeat T3 with the "missing" option (T4), I get different results.These are the same results that I would get if I were using a staticpoststratification weight: 1168 (50) yeses, 946 (44) nos, and 260 (16)missing, adding up to the subpopulation size of 2374 (110). (Note thatthis is the same as including a "if ! missing(everV2)" in the subpop()option.) This is probably better than the seemingly arbitrary result Iget in T2, but I'd really like at least the option for my result to beadjusted up to the subpopulation size.So, Stata does adjust the subpopulation weights, but it doesn't adjustthem to the subpopulation size. What precisely is it doing? I wish Iknew. It seems to me that adjusting to the full subpopulation size isthe correct thing to do, but maybe I'm missing something.Of course, Figen isn't calculating counts, he's calculating proportions.Nevertheless, the size of errors and proportions depends on how Stata iscounting things internally.Does this make sense? Is Stata doing the right thing? (And what *is* itdoing in T3?)----------------------------------------------------------------------------------------- clear set obs 200 set seed 06272009 gen byte femV1 = _n <= 110 // pop, 55% female gen byte everV1 = (uniform() < .55) if (femV1==1) // females only clonevar everV2 = everV1 replace everV2 = . if (uniform() < .15) // add missing values gen byte native = (uniform() <= .5) // about 50% native in sample label define Lyes01 0 "0-No" 1 "1-Yes" label val femV1 native everV1 everV2 Lyes01 gen postwt = cond(native, 3000, 1000) // 75% native in population svyset, poststrata(native) postweight(postwt) svy: tab femV1, count format(%10.0f) obs // [T1] femV1 only svy: tab everV2, count format(%10.0f) obs // [T2] ever2 onlysvy, subpop(femV1): tab everV2, count format(%10.0f) obs // [T3] withsubpopsvy, subpop(femV1): tab everV2, count format(%10.0f) obs miss // [T4]with missing----------------------------------------------------------------------------------------- . svy: tab femV1, count format(%10.0f) obs // [T1] femV1 only (running tabulate on estimation sample)Number of strata = 1 Number of obs= 200Number of PSUs = 200 Population size= 4000N. of poststrata = 2 Design df= 199---------------------------------- femV1 | count obs ----------+----------------------- 0-No | 1626 90 1-Yes | 2374 110 | Total | 4000 200 ---------------------------------- Key: count = counts obs = number of observations . svy: tab everV2, count format(%10.0f) obs // [T2] everV2 only (running tabulate on estimation sample)Number of strata = 1 Number of obs= 94Number of PSUs = 94 Population size= 4000N. of poststrata = 2 Design df= 93---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 1821 44 1-Yes | 2179 50 | Total | 4000 94 ---------------------------------- Key: count = counts obs = number of observations. svy, subpop(femV1): tab everV2, count format(%10.0f) obs // [T3]with subpop(running tabulate on estimation sample)Number of strata = 1 Number of obs= 184Number of PSUs = 184 Population size= 4000N. of poststrata = 2 Subpop. no. of obs= 94Subpop. size =2251.0638Design df= 183---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 1009 44 1-Yes | 1242 50 | Total | 2251 94 ---------------------------------- Key: count = counts obs = number of observations. svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss //[T4] with miss too(running tabulate on estimation sample)Number of strata = 1 Number of obs= 200Number of PSUs = 200 Population size= 4000N. of poststrata = 2 Subpop. no. of obs= 110Subpop. size =2374.4374Design df= 199---------------------------------- everV2 | count obs ----------+----------------------- 0-No | 946 44 1-Yes | 1168 50 . | 260 16 | Total | 2374 110 ---------------------------------- Key: count = counts obs = number of observations -----------------------------------------------------------------------------------------* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

-- Michael I. Lichter, Ph.D. <mlichter@buffalo.edu> Research Assistant Professor & NRSA Fellow UB Department of Family Medicine / Primary Care Research Institute UB Clinical Center, 462 Grider Street, Buffalo, NY 14215 Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Analyzing a subpopulation in Stata 10.1***From:*jpitblado@stata.com (Jeff Pitblado, StataCorp LP)

- Prev by Date:
**st: AW: disagreement between xtreg and xtmixed outputs** - Next by Date:
**st: RE: seemingly unrelated regression** - Previous by thread:
**Re: st: Analyzing a subpopulation in Stata 10.1** - Next by thread:
**Re: st: Analyzing a subpopulation in Stata 10.1** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |