[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Postestimation puzzle(s)

From	[email protected] (Jeff Pitblado, StataCorp LP)
To	[email protected]
Subject	Re: st: Postestimation puzzle(s)
Date	Wed, 25 Nov 2009 12:11:35 -0600
Philip Burgess <[email protected]> is working with data from two
survey, and wants to use direct standardization in order to make comparisons
between the two.

Philip's original posting (corrected version) is included following my
signature.

I'm posting my conclusion first so that it doesn't get lost in my exposition
of Philip's data analysis experiment.

Conclusion
----------

Poststratification does not line up with Philip's expectations, but direct
standardization does.

It appears that Philip needs to verify that he generated the correct
standardizing weights.

Exposition
----------

Starting with the auto dataset, I built a simulated dataset to illustrate
Philip's data analysis experiment and verify his expectations.

Here are the relevant variables:

pw            - survey sampling weight variable
foreign       - indicator variable of interest
group         - survey group id: 0 old survey, 1 new survey

	. set seed 12345
	. sysuse auto
	. keep if !missing(rep78) & rep > 2
	. gen group = uniform() < .5
	. gen pw = 2/uniform()

Let's generate a standardizing weight variable for group == 1 (new survey),
we'll use 'rep78' to identify the standard strata.

	. egen repwt = total(pw) if group == 1, by(rep78)

Notice that I summed the sampling weights within the standard strata, and only
within the observations for the new survey.

Now I apply these values to the observations for the old survey.

	. sort rep78 group
	. by rep78: replace repwt = repwt[_N]

Now we can replicate Philip's direct standardization experiment:

	. svyset [pw=pw] 
	. svy: proportion for, over(group)
	. svy: proportion for, over(group) stdize(rep78) stdweight(repwt)

The results from the above two estimation commands follow.  Notice that in the
results for group == 1, the estimated proportions are the same between these
two commands; this is as Philip expected but didn't experience with his own
experiment.  The standard errors differ due to the fact that direct
standardization produces different score values from the non-standardized
analysis, for more details see 'The standardized mean estimator' subsection of
the 'Methods and formulas' section of '[R] mean' (proportions of means of
indicator variables).

***** BEGIN:
. svy: proportion for, over(group)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =       1          Number of obs    =      59
Number of PSUs   =      59          Population size  = 619.493
                                    Design df        =      58

     Domestic: foreign = Domestic
      Foreign: foreign = Foreign

            0: group = 0
            1: group = 1

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Domestic     |
           0 |   .7560045   .1155163      .5247735    .9872354
           1 |   .8139268   .0952956      .6231719    1.004682
-------------+------------------------------------------------
Foreign      |
           0 |   .2439955   .1155163      .0127646    .4752265
           1 |   .1860732   .0952956     -.0046817    .3768281
--------------------------------------------------------------

. svy: proportion for, over(group) stdize(rep78) stdweight(repwt)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =       1          Number of obs    =      59
Number of PSUs   =      59          Population size  = 619.493
N. of std strata =       3          Design df        =      58

     Domestic: foreign = Domestic
      Foreign: foreign = Foreign

            0: group = 0
            1: group = 1

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Domestic     |
           0 |   .8970116   .0396547      .8176342    .9763891
           1 |   .8139268   .0459222      .7220036      .90585
-------------+------------------------------------------------
Foreign      |
           0 |   .1029884   .0396547      .0236109    .1823658
           1 |   .1860732   .0459222        .09415    .2779964
--------------------------------------------------------------
***** END:

Philip originally used poststratification in his experiment.  For my
experiment, this amounts to the following analysis:

	. svyset [pw=pw], poststrata(rep78) postweight(repwt)
	. svy: proportion for, over(group)

Philip noticed that the estimated population size was different from the
original non-poststratified results.  He reported that the population size was
estimated to be 1.  I must admit that I initially assumed Philip meant 1
million since his original analysis estimated a pop size of 1.39 million.

For my experiment, the estimated population size is going to be the sum of the
raw sampling weights (pw) for group == 1, since this is how we generated the
standard weights.

Poststratification is a weight adjustment procedure.  As the following results
illustrate, this is very different from direct standardization.

***** BEGIN:
. sum pw if group, mean

. di r(sum)
361.081

. svy: proportion for, over(group)
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =       1          Number of obs    =      59
Number of PSUs   =      59          Population size  = 361.081
N. of poststrata =       3          Design df        =      58

     Domestic: foreign = Domestic
      Foreign: foreign = Foreign

            0: group = 0
            1: group = 1

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
Domestic     |
           0 |   .8497911   .0623371      .7250098    .9745724
           1 |   .8608167   .0603897      .7399335       .9817
-------------+------------------------------------------------
Foreign      |
           0 |   .1502089   .0623371      .0254276    .2749902
           1 |   .1391833   .0603897         .0183    .2600665
--------------------------------------------------------------
***** END:

Based on this, and thinking about Philip's estimated population size of 1, I
believe that Philip may have generated the standardizing relative frequencies
instead of standardizing count. While Stata works properly with counts or
relative frequencies in the standard weight variable, Stata's
poststratification features assume counts.

Here is the contents of the do-file I used to produce the above results:

***** BEGIN:
* Simulated data:
* pw		- survey sampling weight variable
* foreign	- indicator variable of interest
* group		- survey group id: 0 old survey, 1 new survey

set seed 12345
sysuse auto
keep if !missing(rep78) & rep > 2
gen group = uniform() < .5
gen pw = 2/uniform()

* Let's generate a standardizing weight variable for group == 1 (new survey):
* rep78		- standard strata
egen repwt = total(pw) if group, by(rep78)

* Now apply the standard weight values to the old survey observations:
sort rep78 group
by rep78: replace repwt = repwt[_N]

* Now we can replicate Philip's experiment:
svyset [pw=pw]
svy: proportion for, over(group)
svy: proportion for, over(group) stdize(rep78) stdweight(repwt)

* The estimated population size is going to be the sum of the raw sampling
* weights (pw) for group == 1:
svyset [pw=pw], poststrata(rep78) postweight(repwt)
sum pw if group, mean
di r(sum)
svy: proportion for, over(group)

***** END:

--Jeff
[email protected]

> I have 2 surveys, 1997 & 2007, with complex survey designs and
> available for analysis with Jackknife replicate weights. The surveys
> were more or less equivalent in their design: nationally
> representative random (independent) samples (without replacement).
> 
> A possible issue is that the 1997 survey included persons aged 18
> years or older (no upper limit); the 2007 survey included persons aged
> 16 â?? 85 years inclusive. As such, there could be issues about
> combining the two surveys given that the replicate weights are
> calibrated to different population structures â?? not sure there is any
> way around this if this is an explanation.
> 
> I want to age-sex standardize the 1997 data to the 2007 population
> structure. To do so, I first limited the two samples to respondents
> aged 20-69 years inclusive â?? to get like-with-like comparisons. I then
> created a new variable indicating the age-sex strata (10 5-year age
> bands x 2 sex = 20 strata â?? variable name, st_agesex). I then
> estimated the 2007 population size for each of these 20 age-sex strata
> â?? variable name, st_wt).
> 
> I do several runs through the data.
> 
> The first specifies the complex survey design:
> 
> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
> vce(jackknife) mse
> 
> Stata output reports: Number of strata = 1; Population size = 1.39
> million. All this makes sense.
> 
> I then estimate proportions who consulted a psychologist for mental
> health problems in the last 12-months (mhpsyco12: code 0/1) over the
> two surveys (nsmhwb, 0 = 1997; 1 = 2007)
> 
> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
> 
> These give estimates for the unadjusted populations:
> 
> 1997 - 17.0% (SE 1.3%);
> 2007 â?? 37.3% (SE 2.5%).
> 
> All good so far.
> 
> The second pass through the data declares the complex survey design
> with poststratification specification strata and weights:
> 
> . quietly svyset [pweight=mhsfinwt], poststrata(st_agesex) postweight(st_wt) ///
> ï?? jkrweight(wpm*, multiplier(1)) vce(jackknife) mse
> 
> Stata output reports Number of strata = 1; N. of poststrata = 20, and
> Stata also reports a Population size of 1. I donâ??t understand the
> Population size parameter â?? why isnâ??t it 1.39 mill per above?
> 
> I then estimate proportions who consulted a psychologist for mental
> health problems in the last 12-months adjusted for the age-sex stratum
> factors
> 
> . svy jackknife, nodots : proportion mhpsyo12, over(nsmhwb)
> 
> These give estimates for the â??adjustedâ?? age-sex standardized populations:
> 
> 1997 - 15.7% (SE 1.4%);
> 2007 â?? 37.1% (SE 2.6%).
> 
> I expected the 1997 estimate to be reduced given age-sex adjustment â??
> this is the case. But I do not understand why the 2007 â??adjustedâ??
> estimates vary at all from the 2007 â??unadjustedâ?? unadjusted estimates.
> 
> Finally, to try and unravel this matter, I resorted to the original
> complex survey design declaration:
> 
> . quietly svyset [pweight=mhsfinwt], jkrweight(wpm*, multiplier(1))
> vce(jackknife) mse
> 
> Stata output reports Number of strata = 1; N. of std strata = 20 â??
> both of these make sense. Stata also reports a Population size of 1.39
> million. All of these make sense.
> 
> I then tried to â??directly standardizeâ?? the proportions:
> 
> . svy jackknife, nodots : proportion mhpsyo12, stdize(st_agesex)
> stdweight(st_wt) over(nsmhwb)
> 
> These give estimates for the â??adjustedâ?? age-sex standardized populations:
> 
> 1997 - 15.6% (SE 1.4%);
> 2007 â?? 37.8% (SE 3.0%).
> 
> So, Iâ??m confused. I understand why the 1997 estimates vary given
> age-sex adjustment (although a bit confused why the results differ
> between poststratification and direct standardization); I have more
> trouble understanding the varying estimates for 2007.
> 
> Iâ??m struggling to understand all of this and welcome any ideas! Itâ??s
> likely I do not properly understand the postsratification processes.
> 
> Iâ??m using Stata 11.0, born 21 October 2009.
> 
> Any thoughts or ideas most grateful!
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Postestimation puzzle(s)
  - From: Philip Burgess <[email protected]>
Prev by Date: st: Retrieving statistics after pstest to use with estout
Next by Date: st: -tabplot- updated from SSC
Previous by thread: Re: st: Postestimation puzzle(s)
Next by thread: Re: st: Postestimation puzzle(s)
Index(es):
- Date
- Thread