[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
[email protected] (Jeff Pitblado, StataCorp LP) |

To |
[email protected] |

Subject |
Re: st: Postestimation puzzle(s) |

Date |
Wed, 25 Nov 2009 12:11:35 -0600 |

Philip Burgess <[email protected]> is working with data from two survey, and wants to use direct standardization in order to make comparisons between the two. Philip's original posting (corrected version) is included following my signature. I'm posting my conclusion first so that it doesn't get lost in my exposition of Philip's data analysis experiment. Conclusion ---------- Poststratification does not line up with Philip's expectations, but direct standardization does. It appears that Philip needs to verify that he generated the correct standardizing weights. Exposition ---------- Starting with the auto dataset, I built a simulated dataset to illustrate Philip's data analysis experiment and verify his expectations. Here are the relevant variables: pw - survey sampling weight variable foreign - indicator variable of interest group - survey group id: 0 old survey, 1 new survey . set seed 12345 . sysuse auto . keep if !missing(rep78) & rep > 2 . gen group = uniform() < .5 . gen pw = 2/uniform() Let's generate a standardizing weight variable for group == 1 (new survey), we'll use 'rep78' to identify the standard strata. . egen repwt = total(pw) if group == 1, by(rep78) Notice that I summed the sampling weights within the standard strata, and only within the observations for the new survey. Now I apply these values to the observations for the old survey. . sort rep78 group . by rep78: replace repwt = repwt[_N] Now we can replicate Philip's direct standardization experiment: . svyset [pw=pw] . svy: proportion for, over(group) . svy: proportion for, over(group) stdize(rep78) stdweight(repwt) The results from the above two estimation commands follow. Notice that in the results for group == 1, the estimated proportions are the same between these two commands; this is as Philip expected but didn't experience with his own experiment. The standard errors differ due to the fact that direct standardization produces different score values from the non-standardized analysis, for more details see 'The standardized mean estimator' subsection of the 'Methods and formulas' section of '[R] mean' (proportions of means of indicator variables). ***** BEGIN: . svy: proportion for, over(group) (running proportion on estimation sample) Survey: Proportion estimation Number of strata = 1 Number of obs = 59 Number of PSUs = 59 Population size = 619.493 Design df = 58 Domestic: foreign = Domestic Foreign: foreign = Foreign 0: group = 0 1: group = 1 -------------------------------------------------------------- | Linearized Over | Proportion Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ Domestic | 0 | .7560045 .1155163 .5247735 .9872354 1 | .8139268 .0952956 .6231719 1.004682 -------------+------------------------------------------------ Foreign | 0 | .2439955 .1155163 .0127646 .4752265 1 | .1860732 .0952956 -.0046817 .3768281 -------------------------------------------------------------- . svy: proportion for, over(group) stdize(rep78) stdweight(repwt) (running proportion on estimation sample) Survey: Proportion estimation Number of strata = 1 Number of obs = 59 Number of PSUs = 59 Population size = 619.493 N. of std strata = 3 Design df = 58 Domestic: foreign = Domestic Foreign: foreign = Foreign 0: group = 0 1: group = 1 -------------------------------------------------------------- | Linearized Over | Proportion Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ Domestic | 0 | .8970116 .0396547 .8176342 .9763891 1 | .8139268 .0459222 .7220036 .90585 -------------+------------------------------------------------ Foreign | 0 | .1029884 .0396547 .0236109 .1823658 1 | .1860732 .0459222 .09415 .2779964 -------------------------------------------------------------- ***** END: Philip originally used poststratification in his experiment. For my experiment, this amounts to the following analysis: . svyset [pw=pw], poststrata(rep78) postweight(repwt) . svy: proportion for, over(group) Philip noticed that the estimated population size was different from the original non-poststratified results. He reported that the population size was estimated to be 1. I must admit that I initially assumed Philip meant 1 million since his original analysis estimated a pop size of 1.39 million. For my experiment, the estimated population size is going to be the sum of the raw sampling weights (pw) for group == 1, since this is how we generated the standard weights. Poststratification is a weight adjustment procedure. As the following results illustrate, this is very different from direct standardization. ***** BEGIN: . sum pw if group, mean . di r(sum) 361.081 . svy: proportion for, over(group) (running proportion on estimation sample) Survey: Proportion estimation Number of strata = 1 Number of obs = 59 Number of PSUs = 59 Population size = 361.081 N. of poststrata = 3 Design df = 58 Domestic: foreign = Domestic Foreign: foreign = Foreign 0: group = 0 1: group = 1 -------------------------------------------------------------- | Linearized Over | Proportion Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ Domestic | 0 | .8497911 .0623371 .7250098 .9745724 1 | .8608167 .0603897 .7399335 .9817 -------------+------------------------------------------------ Foreign | 0 | .1502089 .0623371 .0254276 .2749902 1 | .1391833 .0603897 .0183 .2600665 -------------------------------------------------------------- ***** END: Based on this, and thinking about Philip's estimated population size of 1, I believe that Philip may have generated the standardizing relative frequencies instead of standardizing count. While Stata works properly with counts or relative frequencies in the standard weight variable, Stata's poststratification features assume counts. Here is the contents of the do-file I used to produce the above results: ***** BEGIN: * Simulated data: * pw - survey sampling weight variable * foreign - indicator variable of interest * group - survey group id: 0 old survey, 1 new survey set seed 12345 sysuse auto keep if !missing(rep78) & rep > 2 gen group = uniform() < .5 gen pw = 2/uniform() * Let's generate a standardizing weight variable for group == 1 (new survey): * rep78 - standard strata egen repwt = total(pw) if group, by(rep78) * Now apply the standard weight values to the old survey observations: sort rep78 group by rep78: replace repwt = repwt[_N] * Now we can replicate Philip's experiment: svyset [pw=pw] svy: proportion for, over(group) svy: proportion for, over(group) stdize(rep78) stdweight(repwt) * The estimated population size is going to be the sum of the raw sampling * weights (pw) for group == 1: svyset [pw=pw], poststrata(rep78) postweight(repwt) sum pw if group, mean di r(sum) svy: proportion for, over(group) ***** END: --Jeff [email protected] > I have 2 surveys, 1997 & 2007, with complex survey designs and > available for analysis with Jackknife replicate weights. The surveys > were more or less equivalent in their design: nationally > representative random (independent) samples (without replacement). > > A possible issue is that the 1997 survey included persons aged 18 > years or older (no upper limit); the 2007 survey included persons aged > 16 â?? 85 years inclusive. As such, there could be issues about > combining the two surveys given that the replicate weights are > calibrated to different population structures â?? not sure there is any > way around this if this is an explanation. > > I want to age-sex standardize the 1997 data to the 2007 population > structure. To do so, I first limited the two samples to respondents > aged 20-69 years inclusive â?? to get like-with-like comparisons. I then > created a new variable indicating the age-sex strata (10 5-year age > bands x 2 sex = 20 strata â?? variable name, st_agesex). I then > estimated the 2007 population size for each of these 20 age-sex strata > â?? variable name, st_wt). > > I do several runs through the data. > > The first specifies the complex survey design: > > . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1)) > vce(jackknife) mse > > Stata output reports: Number of strata = 1; Population size = 1.39 > million. All this makes sense. > > I then estimate proportions who consulted a psychologist for mental > health problems in the last 12-months (mhpsyco12: code 0/1) over the > two surveys (nsmhwb, 0 = 1997; 1 = 2007) > > . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb) > > These give estimates for the unadjusted populations: > > 1997 - 17.0% (SE 1.3%); > 2007 â?? 37.3% (SE 2.5%). > > All good so far. > > The second pass through the data declares the complex survey design > with poststratification specification strata and weights: > > . quietly svyset [pweight=mhsfinwt], poststrata(st_agesex) postweight(st_wt) /// > ï?? jkrweight(wpm*, multiplier(1)) vce(jackknife) mse > > Stata output reports Number of strata = 1; N. of poststrata = 20, and > Stata also reports a Population size of 1. I donâ??t understand the > Population size parameter â?? why isnâ??t it 1.39 mill per above? > > I then estimate proportions who consulted a psychologist for mental > health problems in the last 12-months adjusted for the age-sex stratum > factors > > . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb) > > These give estimates for the â??adjustedâ?? age-sex standardized populations: > > 1997 - 15.7% (SE 1.4%); > 2007 â?? 37.1% (SE 2.6%). > > I expected the 1997 estimate to be reduced given age-sex adjustment â?? > this is the case. But I do not understand why the 2007 â??adjustedâ?? > estimates vary at all from the 2007 â??unadjustedâ?? unadjusted estimates. > > Finally, to try and unravel this matter, I resorted to the original > complex survey design declaration: > > . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1)) > vce(jackknife) mse > > Stata output reports Number of strata = 1; N. of std strata = 20 â?? > both of these make sense. Stata also reports a Population size of 1.39 > million. All of these make sense. > > I then tried to â??directly standardizeâ?? the proportions: > > . svy jackknife, nodots : proportion mhpsyo12, stdize(st_agesex) > stdweight(st_wt) over(nsmhwb) > > These give estimates for the â??adjustedâ?? age-sex standardized populations: > > 1997 - 15.6% (SE 1.4%); > 2007 â?? 37.8% (SE 3.0%). > > So, Iâ??m confused. I understand why the 1997 estimates vary given > age-sex adjustment (although a bit confused why the results differ > between poststratification and direct standardization); I have more > trouble understanding the varying estimates for 2007. > > Iâ??m struggling to understand all of this and welcome any ideas! Itâ??s > likely I do not properly understand the postsratification processes. > > Iâ??m using Stata 11.0, born 21 October 2009. > > Any thoughts or ideas most grateful! * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Postestimation puzzle(s)***From:*Philip Burgess <[email protected]>

- Prev by Date:
**st: Retrieving statistics after pstest to use with estout** - Next by Date:
**st: -tabplot- updated from SSC** - Previous by thread:
**Re: st: Postestimation puzzle(s)** - Next by thread:
**Re: st: Postestimation puzzle(s)** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |