Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SV: SV: st: Survey - raking - calibration - post stratification - calculating weights


From   "Kristian Wraae" <Kristian_Wraae@vip.cybercity.dk>
To   <statalist@hsphsun2.harvard.edu>
Subject   SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Date   Sun, 7 Dec 2008 21:33:07 +0100

Ok, tanks.

So weight1 will be defined as:

age_grp	n_age_grp_s	pct_age_grp_s
1		38		6.33
2		47		7.83
3		41		6.83
4		41		6.83
5		44		7.33
6		38		6.33
7		44		7.33
8		48		8.00
9		43		7.17
10		41		6.83
11		42		7.00
12		35		5.83
13		39		6.50
14		33		5.50
15		26		4.33
Total 	600

gen weight1 = n_age_grp/n_age_grp_s with n_age_grp being from the table
below.

I don't understand this: "svy: mean ed"

Neither does Stata. 

I'm using Stata 8.

There is a command called svymean. Isn't that the correct command?

Regarding missing values then I have no missing values amongst the 600 since
I asked them when they were interviewed  later on in the study.

Best regards
Kristian 

-----Oprindelig meddelelse-----
Fra: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven Samuels
Sendt: Sunday, December 07, 2008 8:56 PM
Til: statalist@hsphsun2.harvard.edu
Emne: Re: SV: st: Survey - raking - calibration - post stratification -
calculating weights


On Dec 7, 2008, at 2:18 PM, Kristian Wraae wrote:
> I'll use the 4975 as the population):
>
> weight1 = 10000 / 4975
>
No:  The advice was to use a different weight in each age group.   
weight1 = N/n where the N is in your list below and the "n" the  
corresponding number in the sample of 600 men.  So, for example if,  
in age group 1 there were 30 men in the sample of 600, then weight1  
would have the value (450/30) = 15 for that group.  The population  
size "10,000" is purely a  convenient fiction so that the total  
population counts will match for all the control variables (age and  
smoking, here).  It's okay to use such a fiction because for  
estimating means and proportions, the population totals do not matter.
>
> I will here keep the 15 age strata just to make it simple.
>
> The 4975 men were distributed like this:
>
> age_grp	n_age_grp	pct_age_grp	Cum.
> 1		450		9.05		9.05
> 2		438		8.80		17.85
> 3		395		7.94		25.79
> 4		375		7.54		33.33
> 5		376		7.56		40.88
> 6		370		7.44		48.32
> 7		344		6.91		55.24
> 8		315		6.33		61.57
> 9		306		6.15		67.72
> 10		299		6.01		73.73
> 11		275		5.53		79.26
> 12		271		5.45		84.70
> 13		263		5.29		89.99
> 14		241		4.84		94.83
> 15		257		5.17		100.00
> Total		4975
>
> So I create a variable called pct_age_grp = n_age_grp / 4975 and
> get the
> values above.
>
> Now I create the variable
>
> 	gen tot_age_grp = round(pct_age_grp * 10000)
>

> It looks like this
>
> age_grp	tot_age_grp
> 1		904
> 2		880
> 3		794
> 4		754
> 5		756
> 6		744
> 7		691
> 8		633
> 9		615
> 10		601
> 11		553
> 12		545
> 13		529
> 14		484
> 15		517
> Total		10001
>
> So I subtract one from group 1 :
>
> 	replace tot_age_grp = tot_age_grp - 1 if age_grp == 1
>
> Now the total is 10000.
YES

>
> For smoking I have a variable based on packyear which has three
> categories.
>
> smoke_grp	n_smoke_grp	pct_smoke_grp
> 1		801		23.45
> 2		1,272		37.24	
> 3		1,343		39.31	
> Total		3416
>
> So I generate tot_smoke_grp = round(pct_smoke_grp * 10000)
>
> The totals are:
>
> smoke_grp	tot_smoke_grp
> 1		2345
> 2		3724
> 3		3931
> Total		10000
>
> So the total is 10000. No need to do more here.
>


> My dataset contains all 4975 men. I assume that I should drop all 
> observations except the ones for 600 men before running the survwgt ?
>
Yes--this is key. If your outcomes are only defined in the 600 man  
sample, then that is the analysis data set.  I see above that you  
have missing values for smoking among the 3,750 men who responded to  
the second phase of the survey. You have a potential problem if  any  
of the 600 men in the final sample are also missing the smoking  
variable..  For the purposes of raking you must assign these men to  
one of the three smoking categories. I suggest that you assign them  
as close to the known proportions as possible-about 23% to group1,  
37% to group 2, and 39% to group 3.

> 	keep if sample == 1 (the 600 men all had sample == 1 and everybody 
> else has sample == 0)
>
> So now I run the survwgt command:
>
> survwgt rake  weight1,   ///
>         by(age_grp  smoke_gr) ///
>         totvars(tot_age_grp tot_smoke_grp) ///
>         gen(weight2)

YES

>
> If I have a binary variable called ed amongst the 600 men. And the 
> distribution amongst these men is that 23% have ed = 1 and the rest
> have
> ed==0.
>
> How do I estimate ed amongst the 10.000 men?
>
> Is it:
>
> svymean ed
>
No the command is " svy: mean ed"   You must first -svyset- your  
data, as I outlined below.  Don't use any finite population  
correction ("fpc") options.

At this point you should familiarize yourself with the different  
commands available.  Try -help for "svy"  and, from there, for -svy  
estimation-. Both the help and manual have good examples.

Good luck

Steven

>
>
>
> -----Oprindelig meddelelse-----
> Fra: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven
> Samuels
> Sendt: Sunday, December 07, 2008 5:02 PM
> Til: statalist@hsphsun2.harvard.edu
> Emne: Re: SV: SV: st: Survey - raking - calibration - post  
> stratification -
> calculating weights
>
>
>
>
> Kristian, raking on the two or more variables, with the totals coming
> from different populations, is easy.
>
> 1. Create the initial weight1 =N/n with "population" N and sample n
> in age groups as Stas and I suggested in the previous email.
>
> 2. Then, create categorized variables for age, medicin, smoke  You
> will create counts for these categories (tot_age, tot_medicin,
> tot_smoke) from the control percentages, but with a "population size"
> of 10,000 across all.
>
> 2.1 Age:  These will be numbers based on percentages in the original
> 5,000 men, though it would be *much* better to base them on the
> Danish Census data. (If I were a journal reviewer, I would not accept
> a publication that did not do this unless there was a very good
> reason.) The data source (5,000 men or census) is known as the
> "external" or "control" population for age.
>
> I would suggest you create a variable with fewer than 15 categories,
> as too many categories can prevent the raking algorithm from working.
> I will call the variable agex
>
> You must compute the percentages of observations in each category of
> agex externally and merge them into the 600 man data set.
>
> For example, suppose that in the control population, the first few
> categories of agex have the following percentages
>
> agex    pct_agex     tot_agex (= 100 x pct_agex, rounded to nearest 1)
>
> 1         8.23          823
> 2        10.41         1041
> etc.
>
>
> Total   100.00        10,000
> Important: If the totals do not add to 10,000 then adjust the counts
> of the largest few categories so they do.
>
> You can add tot_agex by hand to the  600 man data set, or create it
> externally and merge it in.
>
>
>
> 2.2. For medicin, do the same kind of categorization, but base the
> percentages on the 3,750 man data set.  Here I assume that medicin,
> has three categories.
>
> medicin    pct_medicin    tot_medicin
> 1           30.23         3023
> 2           45.86         4586
> 3           23.93         2393
> Total      100.02        10002
>
> The original totals must be adjusted so that they add up exactly to
> 10,000. In this case, for example I would subtract 1 from  totals for
> the largest two groups.  3023->3022 and 4586 ->4585
>
>
> 2.3.  You can also do the same with smoking: create smoke categories
> and tot_smok as the totals in each which add to 10,000 exactly.  In
> fact, if  the number of smoking and medicin combinations is small
> (say 3 x 3 = 9), you can create a combined variable, with the
> percentages in each.
>
> med_smok     pct_med_smok    tot_medsmok
> 1
> 2
> 3
> ..
> 9
>
> If you  do this, then you do not need the separate medicin adjustment
> and smoke margins.
>
> 3. Rake the three control variables (agex, medicin, smoke)
> simultaneously.
>
>
> **************************CODE BEGINS**************************
> survwgt rake  weight1,   ///
>         by(age medicin smoke) ///
>         totvars(tot_agex tot_medicin tot_smoke ///
>         gen(weight2)
> ***************************CODE ENDS*************************** Or,  
> with a
> combined med_smok margin. **************************CODE
> BEGINS**************************
> survwgt rake  weight1,   ///
>         by(age med_smok) ///
>         totvars(tot_agex tot_med_smok ///
>         gen(weight2)
> ***************************CODE ENDS***************************
>
> (Note the comma in the first line, which was missing from my previous
> post.) Rarely will you need more than the default 10 iterations in -
> survwgt rake-. If you do, the program will issue an error message.
> You can increase the number by adding a -maxrep- option at the end:
> e.g. "maxrep(100)"
>
> If the number of sample observations in any control cell (agex,
> medicin, smoke (or medicin_smoke) is too small, then the program may
> not converge or will take a long time.  In that case, you will need
> to merge sparse adjacent categories.  Suppose, for example, that you
> start out with 9 medicin_smoke combinations, but two of them have few
> observations among the 600 men final sample.  Then merge these into
> adjacent categories and create a new 7 category variable.
>
> 4. Finally: -svyset- your data and run Stata's survey programs:
>
> svyset _n [pweight=weight2], strata(age_gp)
>
> Here "age_gp" is your original age variable with 15 categories.  You
> can probably omit the strata option at no loss. Be sure that if you
> want estimates for subpopulations, you do use the -subpop- option and
> not an "if" option.
>
> -Steven
>
>
>
> On Dec 7, 2008, at 4:52 AM, Kristian Wraae wrote:
>
>> Thanks Stas & Steven
>>
>> What I would like to do is to calibrate on some of the measures
>> from the
>> first questionaire.
>>
>> I have data on 3750 men from that first questionnaire and I would
>> like to
>> transform my 600 man population into my 5000 man population so that
>> the
>> distribution of chronic diseases and medication is the same as we
>> would
>> expect it to be in the 5000 man population.
>>
>> I know how the 5000 men differs from the 3750 men regarding age and
>> geaography. There was a slight effect of age, but geography was not
>> important for non-responders. So adjusting for age is really the
>> only thing
>> needed at this step.
>>
>> Then I know how the 600 differs from the 3750 men. The 600 are better
>> educated, smoke less and do more exercise and then they are
>> slightly less
>> prone to have chronic diseases and then they are slightly younger.
>>
>> So I'd like to weight each of the 600 men so that I can compensate  
>> for
>> education, smoking, physical activity, chronic diseases (and
>> medication but
>> they are closely related so I think I'll just adjust for medication
>> as it is
>> the most precise measure) and age.
>>
>> So if I want to adjust for those, how do I go by that?
>>
>> I can see that the code below will adjust on age and geography
>> since those
>> data are present through the two steps, but the more detailed
>> information on
>> smoing, health and lifestyle is only present in step two.
>>
>> I don't know the tot_medgb (medicin) or tot_smokegp (smoking)
>> amongst the
>> 5000 but only amongst the 3750.
>>
>> That is how do I incoorporate the two steps into the raking? Or
>> should I use
>> the post stratification command instead since I know these data on  
>> the
>> individual level?
>>
>> As I see it running two rakings after each other: one for step 1
>> and one for
>> step 2 would risk changing the what has been done in the first  
>> raking.
>>
>> I might be stupid but I don't really see how I can do this using
>> the code
>> below.
>>
>> Also,how many variables is it adviseable to rake on?
>>
>> Thank you for your help
>> Kristian
>>
>>
>>
>> -----Oprindelig meddelelse-----
>> Fra: owner-statalist@hsphsun2.harvard.edu
>> [mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven
>> Samuels
>> Sendt: Sunday, December 07, 2008 6:43 AM
>> Til: statalist@hsphsun2.harvard.edu
>> Emne: Re: SV: st: Survey - raking - calibration - post
>> stratification -
>> calculating weights
>>
>>
>> --
>>
>> Stas, I am envious of statisticians who draw samples from those
>> lists.  This is a double sample and I agree with your advice: give
>> everyone the weight for their age stratum:
>>                            weight1 = N_i/n_i
>> where "N" denotes population and "n" denotes sample size.  Kristian
>> apparently thinks of the 5,000 person sample as his "population"; the
>> figure that he linked to does not show the initial sampling step at
>> all. He may not have access to  the one-year census counts. If he
>> does not, I suggest that he use the N's from the 5,000.  I  suggest
>> below that he also form  geographic categories and rake those, with
>> population counts, if possible, otherwise with counts from the
>> 5,000.  I roughly calculate that with 5,000 in the first phase
>> sample, bias in estimates and in standard errors will be small.
>>
>> Kristian, here is how to simultaneously match the age distribution
>> and the geographic distribution of the final sample to your
>> population. (This is called "sample balancing" or "raking".)  Form
>> age groups (agegp) and geographical groupings (geogp) and get the
>> population counts(or percentages, see below) in each cell.
>>
>> **************************CODE BEGINS**************************
>> * tot_agep =  total for population in participant age group (agegp)
>> * tot_geogp = total for population in participant geographical group
>> (geogp)
>> **************************************************************
>>
>> survwgt rake  weight1  ///
>>        by(agegp geogp) ///
>>        totvars(tot_agegp tot_geogp ///
>>        gen(weight2)
>> ***************************CODE ENDS***************************
>>
>>
>> Raking can present problems, so so I suggest that you read http://
>> www.abtassociates.com/ presentations/raking_survey_data_2_JOS.pdf.
>> If you
>> cannot get
>> population counts, perhaps you can get population percentages,
>> multiply by 10 or 100 and  round to the nearest whole number (e.g.
>> 5.12% = 51 or 512), so that the population "size" is 1,000 or 10,000.
>> For estimating means and proportions, these will yield nearly the
>> same results as actual population counts. The Denmark census counts
>> or percentages might be available only in larger age categories than
>> the ones you used to draw the sample: say (60-64, 65-70,70-74). If
>> so, use those for the raking calculations.
>>
>> If you have, say, four geographical categories, you may be tempted to
>> use  4 x 15 =60 stratification combinations.  However, with only 600
>> people in the final sample, the numbers in individual cells will be
>> too small for reliable estimation.
>>
>> Theory for double sampling can be found in WG Cochran, 1973, Sampling
>> Techniques, pp 117-119, 327-334,  or in most other texts.
>> Unfortunately, raking will not completely solve the problem of non-
>> response.
>>
>> -Steven
>>
>> On Dec 6, 2008, at 11:19 PM, Stas Kolenikov wrote:
>>
>>> Steven,
>>>
>>> you might be shocked, but people in Nordic countries do have their
>>> population completely enumerated. Putting NJC's hat on :)), let me
>>> remind you that this is an international list, and different
>>> countries
>>> have different standards of how they collect and store their  
>>> official
>>> data. Denmark has a register with an equivalent of SSN that makes it
>>> possible to combine the data three ways from economic, medical and
>>> social perspectives. That's a survey statistician and a
>>> microeconometrician dream... and they actually do have the
>>> capacity of
>>> drawing SRS. That is, the first 5000 were SRS of the population, and
>>> then Kristian continued a with stratified second phase sampling.
>>>
>>> I would probably just give everybody the weight = # in age group
>>> across Denmark (in some meaningfully defined period of the  
>>> study) / #
>>> in age in group in the sample. If you treat sample groups as
>>> non-response adjustment cells, that's what this will probably boil
>>> down to after multiplication of three or so fractions. ches and help
>>> try:
>> *

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index