[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Philip Burgess <philip.burgess.uq@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Postestimation puzzle(s) |

Date |
Sat, 28 Nov 2009 15:14:57 +1000 |

Jeff; First of all, thank you for such an exhaustive analysis to my query. I think the problem is solved – I’ll explain further below! But I also want to thank you for stepping through the process with the detailed exposition and the example do file code – I could easily replicate this and stepping through it helped me understand the various issues – this was extremely helpful and informative. I had at least two problems: 1. the population size of N = 1; and 2. the sampling frame denominator. I resolved the first issue by setting my st_wt variable to counts rather than proportions. Although, it is worth noting, that nothing changed other than the reported Population size. The second issue was resolved more obtusely. First, I had (correctly) developed stratification weights for age-sex based on the entire sample aged 20-69 years inclusive. I believe this is the right thing to do. Second, the example I provided was based on an analysis of a subset of the overall data – namely those who had a disorder and consulted a psychologist. I now believe that the filtering of the data in this way is not a true test of age-sex standardization. So, to verify your explanation, I tested by not filtering the data. I got the results you had hypothesized: (i) the proportions for 2007 did not change; (ii) the age-sex standardized proportions for 1997 did change; and (iii) the SEs did change for both 1997 & 2007 when I ‘directly standardized’. So, in summary, all is good and makes entire sense. I’m left with one issue – whether I should or could declare the age-sex stratification and weights factors as poststratification specifications in the svyset. Admittedly, I’m not across the conceptual issues with this and may be entirely naively going down the wrong path. I guess I can ‘test’ these effects for prevalence estimates; and I guess too I have two options for controlling for this in subsequent analyses: 1. declare the survey settings with poststratification factors – not sure if this is the right thing to do; and/or 2. include age-sex strata in say any logistic regression models. Not sure – but regardless, I have made significant progress given your help. Thank you again for the time and the effort to prepare such a considered analysis and exposition of the query I raised. Many thanks; Philip On Thu, Nov 26, 2009 at 4:11 AM, Jeff Pitblado, StataCorp LP <jpitblado@stata.com> wrote: > Philip Burgess <philip.burgess.uq@gmail.com> is working with data from two > survey, and wants to use direct standardization in order to make comparisons > between the two. > > Philip's original posting (corrected version) is included following my > signature. > > I'm posting my conclusion first so that it doesn't get lost in my exposition > of Philip's data analysis experiment. > > Conclusion > ---------- > > Poststratification does not line up with Philip's expectations, but direct > standardization does. > > It appears that Philip needs to verify that he generated the correct > standardizing weights. > > Exposition > ---------- > > Starting with the auto dataset, I built a simulated dataset to illustrate > Philip's data analysis experiment and verify his expectations. > > Here are the relevant variables: > > pw - survey sampling weight variable > foreign - indicator variable of interest > group - survey group id: 0 old survey, 1 new survey > > . set seed 12345 > . sysuse auto > . keep if !missing(rep78) & rep > 2 > . gen group = uniform() < .5 > . gen pw = 2/uniform() > > Let's generate a standardizing weight variable for group == 1 (new survey), > we'll use 'rep78' to identify the standard strata. > > . egen repwt = total(pw) if group == 1, by(rep78) > > Notice that I summed the sampling weights within the standard strata, and only > within the observations for the new survey. > > Now I apply these values to the observations for the old survey. > > . sort rep78 group > . by rep78: replace repwt = repwt[_N] > > Now we can replicate Philip's direct standardization experiment: > > . svyset [pw=pw] > . svy: proportion for, over(group) > . svy: proportion for, over(group) stdize(rep78) stdweight(repwt) > > The results from the above two estimation commands follow. Notice that in the > results for group == 1, the estimated proportions are the same between these > two commands; this is as Philip expected but didn't experience with his own > experiment. The standard errors differ due to the fact that direct > standardization produces different score values from the non-standardized > analysis, for more details see 'The standardized mean estimator' subsection of > the 'Methods and formulas' section of '[R] mean' (proportions of means of > indicator variables). > > ***** BEGIN: > . svy: proportion for, over(group) > (running proportion on estimation sample) > > Survey: Proportion estimation > > Number of strata = 1 Number of obs = 59 > Number of PSUs = 59 Population size = 619.493 > Design df = 58 > > Domestic: foreign = Domestic > Foreign: foreign = Foreign > > 0: group = 0 > 1: group = 1 > > -------------------------------------------------------------- > | Linearized > Over | Proportion Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > Domestic | > 0 | .7560045 .1155163 .5247735 .9872354 > 1 | .8139268 .0952956 .6231719 1.004682 > -------------+------------------------------------------------ > Foreign | > 0 | .2439955 .1155163 .0127646 .4752265 > 1 | .1860732 .0952956 -.0046817 .3768281 > -------------------------------------------------------------- > > . svy: proportion for, over(group) stdize(rep78) stdweight(repwt) > (running proportion on estimation sample) > > Survey: Proportion estimation > > Number of strata = 1 Number of obs = 59 > Number of PSUs = 59 Population size = 619.493 > N. of std strata = 3 Design df = 58 > > Domestic: foreign = Domestic > Foreign: foreign = Foreign > > 0: group = 0 > 1: group = 1 > > -------------------------------------------------------------- > | Linearized > Over | Proportion Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > Domestic | > 0 | .8970116 .0396547 .8176342 .9763891 > 1 | .8139268 .0459222 .7220036 .90585 > -------------+------------------------------------------------ > Foreign | > 0 | .1029884 .0396547 .0236109 .1823658 > 1 | .1860732 .0459222 .09415 .2779964 > -------------------------------------------------------------- > ***** END: > > Philip originally used poststratification in his experiment. For my > experiment, this amounts to the following analysis: > > . svyset [pw=pw], poststrata(rep78) postweight(repwt) > . svy: proportion for, over(group) > > Philip noticed that the estimated population size was different from the > original non-poststratified results. He reported that the population size was > estimated to be 1. I must admit that I initially assumed Philip meant 1 > million since his original analysis estimated a pop size of 1.39 million. > > For my experiment, the estimated population size is going to be the sum of the > raw sampling weights (pw) for group == 1, since this is how we generated the > standard weights. > > Poststratification is a weight adjustment procedure. As the following results > illustrate, this is very different from direct standardization. > > ***** BEGIN: > . sum pw if group, mean > > . di r(sum) > 361.081 > > . svy: proportion for, over(group) > (running proportion on estimation sample) > > Survey: Proportion estimation > > Number of strata = 1 Number of obs = 59 > Number of PSUs = 59 Population size = 361.081 > N. of poststrata = 3 Design df = 58 > > Domestic: foreign = Domestic > Foreign: foreign = Foreign > > 0: group = 0 > 1: group = 1 > > -------------------------------------------------------------- > | Linearized > Over | Proportion Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > Domestic | > 0 | .8497911 .0623371 .7250098 .9745724 > 1 | .8608167 .0603897 .7399335 .9817 > -------------+------------------------------------------------ > Foreign | > 0 | .1502089 .0623371 .0254276 .2749902 > 1 | .1391833 .0603897 .0183 .2600665 > -------------------------------------------------------------- > ***** END: > > Based on this, and thinking about Philip's estimated population size of 1, I > believe that Philip may have generated the standardizing relative frequencies > instead of standardizing count. While Stata works properly with counts or > relative frequencies in the standard weight variable, Stata's > poststratification features assume counts. > > Here is the contents of the do-file I used to produce the above results: > > ***** BEGIN: > * Simulated data: > * pw - survey sampling weight variable > * foreign - indicator variable of interest > * group - survey group id: 0 old survey, 1 new survey > > set seed 12345 > sysuse auto > keep if !missing(rep78) & rep > 2 > gen group = uniform() < .5 > gen pw = 2/uniform() > > * Let's generate a standardizing weight variable for group == 1 (new survey): > * rep78 - standard strata > egen repwt = total(pw) if group, by(rep78) > > * Now apply the standard weight values to the old survey observations: > sort rep78 group > by rep78: replace repwt = repwt[_N] > > * Now we can replicate Philip's experiment: > svyset [pw=pw] > svy: proportion for, over(group) > svy: proportion for, over(group) stdize(rep78) stdweight(repwt) > > * The estimated population size is going to be the sum of the raw sampling > * weights (pw) for group == 1: > svyset [pw=pw], poststrata(rep78) postweight(repwt) > sum pw if group, mean > di r(sum) > svy: proportion for, over(group) > > ***** END: > > --Jeff > jpitblado@stata.com > >> I have 2 surveys, 1997 & 2007, with complex survey designs and >> available for analysis with Jackknife replicate weights. The surveys >> were more or less equivalent in their design: nationally >> representative random (independent) samples (without replacement). >> >> A possible issue is that the 1997 survey included persons aged 18 >> years or older (no upper limit); the 2007 survey included persons aged >> 16 – 85 years inclusive. As such, there could be issues about >> combining the two surveys given that the replicate weights are >> calibrated to different population structures – not sure there is any >> way around this if this is an explanation. >> >> I want to age-sex standardize the 1997 data to the 2007 population >> structure. To do so, I first limited the two samples to respondents >> aged 20-69 years inclusive – to get like-with-like comparisons. I then >> created a new variable indicating the age-sex strata (10 5-year age >> bands x 2 sex = 20 strata – variable name, st_agesex). I then >> estimated the 2007 population size for each of these 20 age-sex strata >> – variable name, st_wt). >> >> I do several runs through the data. >> >> The first specifies the complex survey design: >> >> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1)) >> vce(jackknife) mse >> >> Stata output reports: Number of strata = 1; Population size = 1.39 >> million. All this makes sense. >> >> I then estimate proportions who consulted a psychologist for mental >> health problems in the last 12-months (mhpsyco12: code 0/1) over the >> two surveys (nsmhwb, 0 = 1997; 1 = 2007) >> >> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb) >> >> These give estimates for the unadjusted populations: >> >> 1997 - 17.0% (SE 1.3%); >> 2007 – 37.3% (SE 2.5%). >> >> All good so far. >> >> The second pass through the data declares the complex survey design >> with poststratification specification strata and weights: >> >> . quietly svyset [pweight=mhsfinwt], poststrata(st_agesex) postweight(st_wt) /// >> jkrweight(wpm*, multiplier(1)) vce(jackknife) mse >> >> Stata output reports Number of strata = 1; N. of poststrata = 20, and >> Stata also reports a Population size of 1. I don’t understand the >> Population size parameter – why isn’t it 1.39 mill per above? >> >> I then estimate proportions who consulted a psychologist for mental >> health problems in the last 12-months adjusted for the age-sex stratum >> factors >> >> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb) >> >> These give estimates for the ‘adjusted’ age-sex standardized populations: >> >> 1997 - 15.7% (SE 1.4%); >> 2007 – 37.1% (SE 2.6%). >> >> I expected the 1997 estimate to be reduced given age-sex adjustment – >> this is the case. But I do not understand why the 2007 ‘adjusted’ >> estimates vary at all from the 2007 ‘unadjusted’ unadjusted estimates. >> >> Finally, to try and unravel this matter, I resorted to the original >> complex survey design declaration: >> >> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1)) >> vce(jackknife) mse >> >> Stata output reports Number of strata = 1; N. of std strata = 20 – >> both of these make sense. Stata also reports a Population size of 1.39 >> million. All of these make sense. >> >> I then tried to ‘directly standardize’ the proportions: >> >> . svy jackknife, nodots : proportion mhpsyo12, stdize(st_agesex) >> stdweight(st_wt) over(nsmhwb) >> >> These give estimates for the ‘adjusted’ age-sex standardized populations: >> >> 1997 - 15.6% (SE 1.4%); >> 2007 – 37.8% (SE 3.0%). >> >> So, I’m confused. I understand why the 1997 estimates vary given >> age-sex adjustment (although a bit confused why the results differ >> between poststratification and direct standardization); I have more >> trouble understanding the varying estimates for 2007. >> >> I’m struggling to understand all of this and welcome any ideas! It’s >> likely I do not properly understand the postsratification processes. >> >> I’m using Stata 11.0, born 21 October 2009. >> >> Any thoughts or ideas most grateful! > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Postestimation puzzle(s)***From:*Philip Burgess <philip.burgess.uq@gmail.com>

**References**:**Re: st: Postestimation puzzle(s)***From:*jpitblado@stata.com (Jeff Pitblado, StataCorp LP)

- Prev by Date:
**Re: st: Constructing assignment var for Spatial RDD: For each id in region 1 need shortest distance to an id in region 0** - Next by Date:
**Re: RE: st: RE: How to show all the letters in the variable name when they are long?** - Previous by thread:
**Re: st: Postestimation puzzle(s)** - Next by thread:
**Re: st: Postestimation puzzle(s)** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |