# SV: SV: SV: S: SV: st: Survey - raking - calibration - post stratification - calculating weights

 From "Kristian Wraae" To Subject SV: SV: SV: S: SV: st: Survey - raking - calibration - post stratification - calculating weights Date Mon, 8 Dec 2008 22:40:02 +0100

```Thanks Steven

That was what I was missing. I thought I was going crazy :-)

So I'll just recap it:

Age_grp	n_age_grp	pct_age_grp
1		450		9.05
2		438		8.80
3		395		7.94
4		375		7.54
5		376		7.56
6		370		7.44
7		344		6.91
8		315		6.33
9		306		6.15
10		299		6.01
11		275		5.53
12		271		5.45
13		263		5.29
14		241		4.84
15		257		5.17
Total		4975

Age_grp	n_age_grp_q	pct_age_grp_q
1		346		9.24
2		333		8.9
3		304		8.12
4		297		7.93
5		284		7.59
6		275		7.35
7		249		6.65
8		246		6.57
9		231		6.17
10		209		5.58
11		212		5.66
12		210		5.61
13		184		4.92
14		174		4.65
15		189		5.05
Total		3743

So I generate weight1x = (n_age_grp / n_age_grp_q):

Gen weight1x = .
replace weight1x = 450 / 346 if age_grp == 1
.
.
.
replace weight1x = 257 / 189 if age_grp == 15

I generate a variable called quest which is 1 for the 3743 and 0 for the
rest.

I use the totals for tot_age_grp = round(pct_age_grp*4975)  and
tot_geo_grp=round(pct_geo_grp*4975)

Now I rake:
keep if quest==1 (reducing the dataset to 3743 men)
survwgt rake  weight1x,   ///
by(age_grp  geo_grp) ///
totvars(tot_age_grp tot_geo_grp) ///
gen(weight2x)

svyset  [pweight=weight2x], strata(age_grp)

Mean of bmi in the population of 4975 will then be estimated as:

svymean bmi

Now I estimate to probaility of inclusion in the group of the 600 men:

I generate the variable called sample which is 1 for each of the
600 and 0 for the rest of the 3743.

.tab sample

sample	Freq.	Percent	Cum.

0		3,143	83.97		83.97
1		600	16.03		100.00

I now generate the probability of inclusion using as many variables as
possible? I have more than 200...

xi: logistic sample i.age_grp i.geo_grp v1 v2 v3 ..... v200

Predict p_r

gen weight3x = weight2x * (1/p_r)

Now I rake the 600 men back to the age and geograhic categories of the 5000
men using the same totals as earlier):

keep if sample == 1 (reducing the dataset to 600 men)
survwgt rake  weight4x,   ///
by(age_grp  geo_grp) ///
totvars(tot_age_grp tot_geo_grp) ///
gen(weight5x)

And to finally rake the 600 men to the true background population I use

gen tot_back_ground_age_grp = xxx1 if age_grp==1
.
.

And

gen tot_back_ground_geo_grp = yyy1 if age_grp==1
.
.

Then I rake:

survwgt rake  weight5x,   ///
by(age_grp  geo_grp) ///
totvars(tot_back_ground_age_grp tot_back_ground_geo_grp) ///
gen(weight6x)

svyset  [pweight=weight6x], strata(age_grp)

Then the prevalence of hypogonadism (binary variable) in the background
population is:

Sorry for going through all these steps over and over again. But I really
need to know that I'm understanding it correctly.

Thanks
Kristian

-----Oprindelig meddelelse-----
Fra: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven Samuels
Sendt: Monday, December 08, 2008 9:31 PM
Til: statalist@hsphsun2.harvard.edu
Emne: Re: SV: SV: S: SV: st: Survey - raking - calibration - post
stratification - calculating weights

Kristian:

I was vague and I apologize: I mixed up the initial weights.

Step 1. Compute weight1x = N/n but now n is the number of the 3,750
in each age group.  You will never use the original weight that Stas
and I suggested for the 600.

Step 2. Rake  age and geographical area to get new weights weight2x
for the 3,750 people.  To get estimates of population characteristics
from the mail questionnaire, in the data set of 3,750,

Type: "svyset _n [pweight=weight1x], strata(age_grp) as before.

Then you can use -svymean- -svytab- (if available in Stata 8) to
describe the population.

Step 3. For the logistic regression, you should use as many,
predictors in the mail questionnaire data to predict who participated
in the telephone survey, not just age and geography.

Steps 4 and beyond are now confined to the 600 men in the last phase
sample.  You completely ignore from this point on the 3,750 man phase.

Step 5 is correct, as long as you add more covariates to your
logistic regression and check the model fit.

Step 6 is okay.   You don't need to first compute the percentages,
just the original totals in the age groups of the 5,000 men.  You do
not need the 10,000 person trick, because you are only matching to a
single external population in this step.

Step 7.  In Step 7, you can just provide the age group totals for the
census groups, not the percentages.

Thank you for your kind of offer of acknowledgement. I think it
appropriate to refer to help that you received on Statalist, but I do
not wish to be acknowledged individually.

-Steve

On Dec 8, 2008, at 2:10 PM, Kristian Wraae wrote:

> Ok. I'm a bit lost here. I really don't understand all the steps
> (especially
> step 2) but I'll try to do them anyway.
>
>
> *1:
>
> Like before:
> The 600:
> age_grp	n_age_grp_s	pct_age_grp_s
> 1		38		6.33
> 2		47		7.83
> 3		41		6.83
> 4		41		6.83
> 5		44		7.33
> 6		38		6.33
> 7		44		7.33
> 8		48		8.00
> 9		43		7.17
> 10		41		6.83
> 11		42		7.00
> 12		35		5.83
> 13		39		6.50
> 14		33		5.50
> 15		26		4.33
> Total 	600
>
> And the 4975:
>  age_grp	n_age_grp	pct_age_grp	Cum.
>  1		450		9.05		9.05
>  2		438		8.80		17.85
>  3		395		7.94		25.79
>  4		375		7.54		33.33
>  5		376		7.56		40.88
>  6		370		7.44		48.32
>  7		344		6.91		55.24
>  8		315		6.33		61.57
>  9		306		6.15		67.72
>  10		299		6.01		73.73
>  11		275		5.53		79.26
>  12		271		5.45		84.70
>  13		263		5.29		89.99
>  14		241		4.84		94.83
>  15		257		5.17		100.00
>  Total		4975
>
> So weight1 is defined as:
>
> gen weight1=.
> replace weight1 = 450 / 38 if age_grp == 1
> replace weight1 = 438 / 47 if age_grp == 2
> replace weight1 = 395 / 41 if age_grp == 3
> replace weight1 = 375 / 41 if age_grp == 4
> replace weight1 = 376 / 44 if age_grp == 5
> replace weight1 = 370 / 38 if age_grp == 6
> replace weight1 = 344 / 44 if age_grp == 7
> replace weight1 = 315 / 48 if age_grp == 8
> replace weight1 = 306 / 43 if age_grp == 9
> replace weight1 = 299 / 41 if age_grp == 10
> replace weight1 = 275 / 42 if age_grp == 11
> replace weight1 = 271 / 35 if age_grp == 12
> replace weight1 = 263 / 39 if age_grp == 13
> replace weight1 = 241 / 33 if age_grp == 14
> replace weight1 = 257 / 26 if age_grp == 15
>
> *2:
> ?????
> How do I estimate
>
> *3:
>
> *4:
>
> Now I generate a variable called sample which is 1 for each of the
> 600 and 0
> for the rest of the 3743.
>
> .tab sample
>
> sample	Freq.	Percent	Cum.
>
> 0		3,143	83.97		83.97
> 1		600	16.03		100.00
>
> I now generate the probability of inclusion using just age and
> geography to
> make things simple:
>
> xi: logistic sample i.age_grp i.geo_grp
>
> Predict p_r
>
> *5:
> gen weight2 = weight1 * (1/p_r)
>
> *6:
>
> Now I generate the totals for age and geography:
>
> *age
> gen pct_agex = .
> replace pct_agex = 450 / 4975 if age_grp == 1
> replace pct_agex = 438 / 4975 if age_grp == 2
> replace pct_agex = 395 / 4975 if age_grp == 3
> replace pct_agex = 375 / 4975 if age_grp == 4
> replace pct_agex = 376 / 4975 if age_grp == 5
> replace pct_agex = 370 / 4975 if age_grp == 6
> replace pct_agex = 344 / 4975 if age_grp == 7
> replace pct_agex = 315 / 4975 if age_grp == 8
> replace pct_agex = 306 / 4975 if age_grp == 9
> replace pct_agex = 299 / 4975 if age_grp == 10
> replace pct_agex = 275 / 4975 if age_grp == 11
> replace pct_agex = 271 / 4975 if age_grp == 12
> replace pct_agex = 263 / 4975 if age_grp == 13
> replace pct_agex = 241 / 4975 if age_grp == 14
> replace pct_agex = 257 / 4975 if age_grp == 15
>
> gen tot_agex = round(pct_agex * 10000)
>
> replace tot_agex = tot_agex - 1 if agex ==1
>
> *Geography
> gen pct_geo =.
> replace pct_geo = 2726 / 4975 if geo_gr==1
> replace pct_geo = 2249 / 4975 if geo_gr==2
>
> gen tot_geo = round(pct_geo * 10000)
>
> * Now I rake weight2 back to the age categories & geographics
>
> keep if sample==1
>
> survwgt rake  weight2,   ///
>         by(age_grp  geo_grp) ///
>         totvars(tot_agex tot_geo) ///
>         gen(weight3)
>
>
> *7
>
> Here I make new variables for tot_agex and tot_grp from data from
> the Danish
> Census (_DC) like this:
>
> *age
> gen pct_agex = .
> replace pct_agex_DC = (DC population total in age_grp==1) / (DC
> population
> total) if age_grp == 1
> .
> .
> .
> .
> replace pct_agex_DC = (DC population total in age_grp==15) / (DC
> population
> total) if age_grp == 15
> gen tot_agex_DC = round(pct_agex_DC * 10000)
>
> And the same for tot_geo_DC
>
> Then I use the rake again
>
> survwgt rake  weight3,   ///
>         by(age_grp  geo_grp) ///
>         totvars(tot_agex_DC tot_geo_DC) ///
>         gen(weight4)
>
> svyset  [pweight=weight4], strata(agex)
>
> So to estimate ed in the general populaion I would do:
>
> svymean ed
>
> Is it correct?
>
> Steven if you give me your personal details I'll include you in the
> acknowledgements of the paper if you'd like.
>
> Best regards
> Kristian
>
> -----Oprindelig meddelelse-----
> Fra: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven
> Samuels
> Sendt: Monday, December 08, 2008 6:13 PM
> Til: statalist@hsphsun2.harvard.edu
> Emne: Re: SV: S: SV: st: Survey - raking - calibration - post
> stratification
> - calculating weights
>
>
> On Dec 8, 2008, at 2:55 AM, Kristian Wraae wrote:
>
>> Ok, thanks.
>>
>> Now I understand how to do the raking procedure.
>>
>> I have one question though.
>>
>> Since I have a two step inclusion procedure wouldn't it be more
>> accurate to
>> rake in two steps.
>>
>> Example:
>> I know the distribution of medication amongst the 3745 men.
>>
>> But the 3745 men differs from the 4975 men by being slightly
>> younger and we
>> know that the older you get the more medicin do you get. That also
>> goes for
>> physical activity and smoking.
>>
>> So if I calculate the expected prevalences amongst the 4975 (in
>> order to
>> rake the 600) from the 3750 I risk making a mistake
>> (underestimating the
>> prevalences in the baclground population). I guess should be
>> calculating the
>> all prevalences from the 4975, but I don't those data.
>>
>> So wouldn't it be more correct to:
>>
>> 1. Rake the 3750 so they match the 4975 on age and geography.
>>
>> 2. Calculate all the expected prevalences on age, medication,
>> smoking,
>> physical activity ect from the now raked 3750 (as we would expect
>> them to be
>>
>> 3. Use these prevalences to rake the 600 as you showed me?
>
>
> Your concern is a good one, Kristian.  However, the solution you
> propose is ad-hoc with no real theoretical justification. I've tried
> some complicated raking in the past, but I have never seen a
> reference to the method you propose. You have much questionnaire
> information on too many informative variables; raking can use only a
> small part of it.  There is a standard approach to this problem:
> model the probability of participating in the phone interview. I
> suggest you consult the text "Statistical Analysis with Missing Data"
> by Little & Rubin, especially Chapters 3 & 13.  In the parlance of
> that book, you must assume that data are "Missing at Random". This
> means that the probability of having a phone interview depends
> completely on characteristics known from the mail questionnaire or
> the census.
>
> Here are the steps:
>
> 1. Estimate weight1 = N_i/n_i  as before for the 15 age groups.
>
> 2. You can use this weight on the second phase sample of 3,750 to
> estimate various properties of the population known such as
> proportions in categories of medication, physical activity smoking.
> These may be of interest in themselves.
>
> 3. Instead of raking, use -logistic- or -logit- (not the survey
> versions)  on the 3,750 men to predict who participated in the
> telephone interview.  Consider as covariates: age, geography,
> medication, physical activity, smoking and any others that might be
> of use.
>
> 4. Generate the predicted probability of participating in the
> telephone interview.   Call this p_r.  Your goal is to get a good
> prediction, so compute ROC curves, if possible.  (I don't recall if
> Stata 8 has the -lroc- command.)
>
> 5. For the 600 men in the telephone survey, compute:  weight2 =
> (weight1) x (1/p_r).
>
> 6.  Rake weight2 back to the age categories & geographic categories
> of  the 5,000 men.  Call the result "weight3".
>
> 7. Finally rake weight3 to the Danish Census age/geographical
> breakdowns: Call it "weight4".
>
> 7. Use this as your final analysis weight for -svymean-.
>
> You are a long way from the simplicity of Stas's earlier suggestion
> to use "weight1" on your data.  Standard errors that you compute will
> be under-estimated, because they do not account for the uncertainty
> in the estimating "weight3", and you must state this in your report.
> If you wish to compute the proper standard errors, you must, I think,
> bootstrap the process starting no later than Step 3.  This is the
> price for using the complex sampling design.
>
> -Steve
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```