Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SV: SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights


From   Steven Samuels <sjhsamuels@earthlink.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: SV: SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Date   Tue, 9 Dec 2008 14:09:04 -0500

--

You could keep Danish caucasians as the "target population" (a formal term). For the questionnaire it is also the "sampled population" (also a formal term), but for the phone/exam the sampled population excludes the distant islands, who have about 3% of the target population. This percentage is so small that their exclusion will probably make no discernible difference in the estimates for the target population.

I would not necessarily ignore location. . There are many examples of geographical differences in health (urban/rural; mountains/seaside; North/South; hot/cold). So larger geographical groupings could also be analytic categories. However such categories should not be based on distance from the study site.

Sixty-seven zip codes is too many categories. They will present problems for the raking algorithm and could lead to larger standard errors when you rake to the Danish census. You will need to combine them then; so, why not now?


Be sure to read the Apt publication on raking that I referred to. There is another on weighting and choosing variables for post- stratification: http://www.abtassociates.com/presentations/ AAPOR06_Poststratification.pdf


-Steve

On Dec 9, 2008, at 12:10 PM, Kristian Wraae wrote:

None, since they were not asked to participate.

So we have no samples from that part of the country amongst the 600.

Actually it get a little bit more complicated than that since we also
excluded a few people due to ethnicity.

Our project deals with in part with genestics so only caucasians were
included in the final 600.

The danish population (especially in the age group we are looking at) is
very homogenous in the first place so not that many were excluded.

But I guess we should drop distant islands and non-caucasians from the 4975
and 3743.

That will of course affect the validity of prevalences estimated. They will no longer be distributions in the danish population but rather in danish
caucasians.





-----Oprindelig meddelelse-----
Fra: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Steven Samuels
Sendt: Tuesday, December 09, 2008 5:55 PM
Til: statalist@hsphsun2.harvard.edu
Emne: Re: SV: SV: SV: st: Survey - raking - calibration - post
stratification - calculating weights



Kristian, How many men in the 600 person sample were from the distant
islands?


On Dec 9, 2008, at 11:25 AM, Kristian Wraae wrote:

Thanks Steve

I was referring to the design.
What I meant by same probability was that if I include the distant
islands
in the most distant non-island categories those categories would be
weighted
too high due to the fact that too few of the 600 would be from that
category.


I think the only good solution is to drop the 164 men from the the
4975 and
rake on geography with fewer categories.

included in the final sample". People had different probabilities of
getting into that sample; that is why you are doing the response-
modeling.


-Steve

On Dec 9, 2008, at 10:23 AM, Kristian Wraae wrote:

I think the reason why STATA complains about totals not being equal
is that
I have one geography category missing amingst the 600. We refrained
from
asking people who lived on distant islands, and thus had difficulty
showing
up, to participate in the final sample to avoid have too many
dropouts.

So I suppose we should drop all individuals living on islands
amongst the
4975 (it is only 164) and later amongst the 3743 (120) in order to
do the
final raking with geography.

Alternatively the final raking should be done
without geography since there is really no reason to belive that
geography
should be a factor determining health.



Another approach is to include the islands into the most distant
zip-code
category, but that will interfere with the assumption that all had
the same
probability of being included in the final sample.

You misunderstand the purpose of raking.  There is no such assumption
involved.

My best suggesting will be not to rake on geography at in the last
two steps
(or maybe at all).

Age is definately the most important variable to rake on.




-----Oprindelig meddelelse-----
Fra: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Kristian
Wraae
Sendt: Tuesday, December 09, 2008 1:23 PM
Til: statalist@hsphsun2.harvard.edu
Emne: SV: SV: st: Survey - raking - calibration - post
stratification -
calculating weights


Now I have continued to step 2 with this do file:

*Step 2

xi: logistic sample i.age_grp i.geo_grp  i.health_medication
i.health_diseases

predict p_r

gen weight3x = weight2x * (1/p_r)

keep if sample == 1
				*(reducing dataset to 600 men)
survwgt rake  weight3x,   ///
        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight4x)



The problem now is that Stata says that "totals across dimensions 1
and 2
are not equal"

Why is that? Should I generate new totals for tot_age_grp and
tot_geo_grp?
Should they be based on the 3743 Why?

How do I deal with missing values in p_r (depending on which
predictors I
include in the logistisk regression I might get missing values for
p_r).



-----Oprindelig meddelelse-----
Fra: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] På vegne af Kristian
Wraae
Sendt: Tuesday, December 09, 2008 12:35 PM
Til: statalist@hsphsun2.harvard.edu
Emne: SV: SV: st: Survey - raking - calibration - post
stratification -
calculating weights


I have now tried to do the first step of the raking.

I have 15 age groups and 67 geographic groups (simply based on the
zip
codes).

I tried to do the raking first with a smaller number of geographic
groups
(10) but the results were more accurate with all groups.

The variable I have are:
age = continuos variable containg the age of the subject at the
time of
sampling dist_study = continuous variable containing the distance
from the
individual to me. age_grp = categorial variable - 15 age strata.
geo_grp =
zip code quest = 1 if individual returned a filled out
questionnaire pop = 1
if individual was amongst the 4975 in the original sample (all had
of course
pop=1) sample = 1 for each finally included subject.

The do file looks like this:

*************
*To get data from the orginal population
tabstat age
tabstat dist_study

*Raking starts by generating totals in each age group and
geographical group
egen tot_age_grp =  count(pop),by(age_grp) egen tot_age_grp_q =
count(pop)
if quest==1, by(age_grp)

egen tot_geo_grp =  count(pop),by(geo_grp)
egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp) *Inital
weight is
generated gen weight1x = (tot_age_grp / tot_age_grp_q)

keep if quest==1
			*(reducing the dataset to 3743 men)
survwgt rake  weight1x,   ///
        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight2x)

svyset  [pweight=weight2x], strata(age_grp)

*Description
svydes
*Now we estimate the average age in the 4975 men from the 3743 men
svymean
age *Now we estimate the average distance to travel to get to me
for the
4975 men based on the 3743 men svymean  dist_study

*These are the actual numbers for the 3743 men.
tabstat age
tabstat dist_study
******************

The output from Stat8 is:

. *************
. tabstat age

    variable |      mean
-------------+----------
         age |   66.6695
------------------------

. tabstat dist_study

    variable |      mean
-------------+----------
  dist_study |  25.90153
------------------------

.
.
. egen tot_age_grp =  count(pop),by(age_grp)

. egen tot_age_grp_q = count(pop) if quest==1, by(age_grp) (1232
missing
values generated)

.
. egen tot_geo_grp =  count(pop),by(geo_grp)

. egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp) (1232
missing
values generated)

.
. gen weight1x = (tot_age_grp / tot_age_grp_q)
(1232 missing values generated)

.
. keep if quest==1
(1232 observations deleted)

.                         *(reducing the dataset to 3743 men)
. survwgt rake  weight1x,   ///
        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight2x)

.
. svyset  [pweight=weight2x], strata(age_grp)
pweight is weight2x
strata is age_grp

.
. svydes

pweight:  weight2x
Strata:   age_grp
PSU:      <observations>
                                      #Obs per PSU
 Strata                       ----------------------------
 age_grp    #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       346       346         1       1.0         1
       2       333       333         1       1.0         1
       3       304       304         1       1.0         1
       4       297       297         1       1.0         1
       5       284       284         1       1.0         1
       6       275       275         1       1.0         1
       7       249       249         1       1.0         1
       8       246       246         1       1.0         1
       9       231       231         1       1.0         1
      10       209       209         1       1.0         1
      11       212       212         1       1.0         1
      12       210       210         1       1.0         1
      13       184       184         1       1.0         1
      14       174       174         1       1.0         1
      15       189       189         1       1.0         1
--------  --------  --------  --------  --------  --------
      15      3743      3743         1       1.0         1

.
. svymean  age

Survey mean estimation

pweight:  weight2x                                Number of obs    =
3743
Strata:   age_grp                                 Number of strata =
15
PSU:      <observations>                          Number of PSUs   =
3743
                                                  Population size
= 4975

-------------------------------------------------------------------- -
-
------
--
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------
+--------------------------------------------------------------
---------+----
--
age | 66.66605 .0067455 66.65283 66.67928 . 0092211 -------------------------------------------------------------------- -
-
------
--

. svymean  dist_study

Survey mean estimation

pweight:  weight2x                                Number of obs    =
3742
Strata:   age_grp                                 Number of strata =
15
PSU:      <observations>                          Number of PSUs   =
3742
                                                  Population size  =
4973.7235

-------------------------------------------------------------------- -
-
------
--
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------
+--------------------------------------------------------------
---------+----
--
dist_s~y | 25.90772 .3139459 25.2922 26.52325 1.01731 -------------------------------------------------------------------- -
-
------
--

.
. tabstat age

    variable |      mean
-------------+----------
         age |   66.5895
------------------------

. tabstat dist_study

    variable |      mean
-------------+----------
  dist_study |  25.93867
------------------------

.
end of do-file

As one can see the average age amongst the 4975 men is: 66.6695

Using raking and svymean Stata estimates the average age amongst
the 4975
men based on the information from the 3743 men to be: 66.66605

As one can see those are quite similar.

Now let us look at the distance to travel. We raked on zip codes
which are
not equivalent to distances but despite that the results are quite
amazing:

We know the average distance to travel is: 25.90153 km

After raking and basing the results on the 3743 men Stata estimates
the
distance to be: 25.90772 km

Strikingly similar. The true distributions amongst the 3743 are
not as
close: 66.5895 years and 25.93867 kms, but really not that far off.

The differences will be far greater when raking the 600.

I will now go on.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index