# Re: st: Analyzing a subpopulation in Stata 10.1

 From "Michael I. Lichter" <[email protected]> To [email protected] Subject Re: st: Analyzing a subpopulation in Stata 10.1 Date Sat, 27 Jun 2009 16:25:50 -0400

This is kind of long, but I hope that some folks, particularly those with expertise on poststratification and people from StataCorp will hear me out.
```
```
Fundamentally, Figen's question is about how Stata handles missing values under poststratification. It's one I don't know the answer to, but that perhaps somebody from StataCorp could help answer.
```
```
To illustrate the problem, I've included a program and edited output. For those not wanting to do too much scanning, I'll summarize what it does and shows.
```
```
I create a fictional dataset of men and women (femV1) who are randomly assigned to be either native or immigrant status (native; about 50% are native), and among the women I've created an ever-given-birth (everV1) variable (about 55% have). Although the sample is about 50% native/50% immigrant, the hypothetical population is 75% native and I've created a poststratification weight to deal with that. The everV1 variable is by definition missing for all males, but I also created a new variable everV2 that is missing at random for 15% of females.
```
```
The first table T1 below shows that there are, after weighting, 2374 females (110 obs) and 1626 males (90 obs) in the population of 4000 (200 obs). If I tabulate everV2 only (T2), without specifying a subpopulation, we learn that there are 2179 (50) people who have given birth and 1821 (44) who have not. Since there are only 2374 females and only 55% of them have given birth, 2179 is clearly too big a number. Of course, Stata doesn't know that some of the cases are missing by definition while others are missing at random; *it has simply reweighted the sample to the full population size.*
```
```
Now, if I tabulate everV2 for the subpopulation of females (T3), it shows that 1242 (50) have ever had a child and 1009 (44) have not, for a total of 2251 (94). Obviously, 2251 != 2374. *Why hasn't Stata adjusted the weights so that they add up to the full subpopulation size?* I don't know.
```
```
If I repeat T3 with the "missing" option (T4), I get different results. These are the same results that I would get if I were using a static poststratification weight: 1168 (50) yeses, 946 (44) nos, and 260 (16) missing, adding up to the subpopulation size of 2374 (110). (Note that this is the same as including a "if ! missing(everV2)" in the subpop() option.) This is probably better than the seemingly arbitrary result I get in T2, but I'd really like at least the option for my result to be adjusted up to the subpopulation size.
```
```
So, Stata does adjust the subpopulation weights, but it doesn't adjust them to the subpopulation size. What precisely is it doing? I wish I knew. It seems to me that adjusting to the full subpopulation size is the correct thing to do, but maybe I'm missing something.
```
```
Of course, Figen isn't calculating counts, he's calculating proportions. Nevertheless, the size of errors and proportions depends on how Stata is counting things internally.
```
```
Does this make sense? Is Stata doing the right thing? (And what *is* it doing in T3?)
```
Michael

-----------------------------------------------------------------------------------------
clear
set obs 200
set seed 06272009
gen byte femV1 = _n <= 110              // pop, 55% female
gen byte everV1 = (uniform() < .55) if (femV1==1)   // females only
clonevar everV2 = everV1
replace everV2 = . if (uniform() < .15)    // add missing values
gen byte native = (uniform() <= .5)    // about 50% native in sample
label define Lyes01 0 "0-No" 1 "1-Yes"
label val femV1 native everV1 everV2 Lyes01
gen postwt = cond(native, 3000, 1000)  // 75% native in population
svyset, poststrata(native) postweight(postwt)

svy: tab femV1, count format(%10.0f) obs   // [T1] femV1 only
svy: tab everV2, count format(%10.0f) obs   // [T2] ever2 only
```
svy, subpop(femV1): tab everV2, count format(%10.0f) obs // [T3] with subpop svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss // [T4] with missing
```-----------------------------------------------------------------------------------------

. svy: tab femV1, count format(%10.0f) obs   // [T1] femV1 only
(running tabulate on estimation sample)

```
Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 N. of poststrata = 2 Design df = 199
```
----------------------------------
femV1 |      count         obs
----------+-----------------------
0-No |       1626          90
1-Yes |       2374         110
|
Total |       4000         200
----------------------------------
Key:  count     =  counts
obs       =  number of observations

. svy: tab everV2, count format(%10.0f) obs   // [T2] everV2 only
(running tabulate on estimation sample)

```
Number of strata = 1 Number of obs = 94 Number of PSUs = 94 Population size = 4000 N. of poststrata = 2 Design df = 93
```
----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |       1821          44
1-Yes |       2179          50
|
Total |       4000          94
----------------------------------
Key:  count     =  counts
obs       =  number of observations

```
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs // [T3] with subpop
```(running tabulate on estimation sample)

```
Number of strata = 1 Number of obs = 184 Number of PSUs = 184 Population size = 4000 N. of poststrata = 2 Subpop. no. of obs = 94 Subpop. size = 2251.0638 Design df = 183
```
----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |       1009          44
1-Yes |       1242          50
|
Total |       2251          94
----------------------------------
Key:  count     =  counts
obs       =  number of observations

```
. svy, subpop(femV1): tab everV2, count format(%10.0f) obs miss // [T4] with miss too
```(running tabulate on estimation sample)

```
Number of strata = 1 Number of obs = 200 Number of PSUs = 200 Population size = 4000 N. of poststrata = 2 Subpop. no. of obs = 110 Subpop. size = 2374.4374 Design df = 199
```
----------------------------------
everV2 |      count         obs
----------+-----------------------
0-No |        946          44
1-Yes |       1168          50
. |        260          16
|
Total |       2374         110
----------------------------------
Key:  count     =  counts
obs       =  number of observations

-----------------------------------------------------------------------------------------

--
Michael I. Lichter, Ph.D. <[email protected]>
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```