[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Date	Tue, 9 Dec 2008 11:06:16 -0500

If you did not expect men from the distant islands in the finalsample, then exclude them from that part of the analysis. If so,you can also redefine the target population for that analysis to be60-74 year-old mean every place except those in the distant islands.In the analysis of the Q, you can present results which includethem. They are likely to be a very small fraction of the populationand so.

Raking by geography is free for the first phase (describing variablesfor the 3,750 man data set). There are many examples of geographicaldifferences in health: highlands, lowlands, seaside dwellers, so Iwould not drop the broad categories. They might even be analyticcategories for your final analysis. You might have found 67 zip codesmore 'accurate' as you said earlier, but that has no statisticalmeaning to me. In the last phase raking to Danish figures, zip codeswill be too fine a unit and are likely to cause problems and increasestandard errors. So, group them now into meaningful categories.

There is no assumption "that all had the same probability of beingincluded in the final sample". People had different probabilities ofgetting into that sample; that is why you are doing the response-modeling.



-Steve

On Dec 9, 2008, at 10:23 AM, Kristian Wraae wrote:

I think the reason why STATA complains about totals not being equalis thatI have one geography category missing amingst the 600. We refrainedfromasking people who lived on distant islands, and thus had difficultyshowingup, to participate in the final sample to avoid have too manydropouts.
So I suppose we should drop all individuals living on islandsamongst the4975 (it is only 164) and later amongst the 3743 (120) in order todo the
final raking with geography.

Alternatively the final raking should be done
without geography since there is really no reason to belive thatgeography
should be a factor determining health.

Another approach is to include the islands into the most distantzip-codecategory, but that will interfere with the assumption that all hadthe same
probability of being included in the final sample.

You misunderstand the purpose of raking. There is no such assumptioninvolved.

My best suggesting will be not to rake on geography at in the lasttwo steps
(or maybe at all).

Age is definately the most important variable to rake on.



-----Oprindelig meddelelse-----
Fra: [email protected]

[mailto:[email protected]] På vegne af KristianWraae

Sendt: Tuesday, December 09, 2008 1:23 PM
Til: [email protected]

Emne: SV: SV: st: Survey - raking - calibration - poststratification -

calculating weights


Now I have continued to step 2 with this do file:

*Step 2

xi: logistic sample i.age_grp i.geo_grp  i.health_medication
i.health_diseases

predict p_r

gen weight3x = weight2x * (1/p_r)

keep if sample == 1
				*(reducing dataset to 600 men)
survwgt rake  weight3x,   ///
        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight4x)

The problem now is that Stata says that "totals across dimensions 1and 2

are not equal"

Why is that? Should I generate new totals for tot_age_grp andtot_geo_grp?

Should they be based on the 3743 Why?

How do I deal with missing values in p_r (depending on whichpredictors Iinclude in the logistisk regression I might get missing values forp_r).




-----Oprindelig meddelelse-----
Fra: [email protected]

[mailto:[email protected]] På vegne af KristianWraae

Sendt: Tuesday, December 09, 2008 12:35 PM
Til: [email protected]

Emne: SV: SV: st: Survey - raking - calibration - poststratification -

calculating weights


I have now tried to do the first step of the raking.

I have 15 age groups and 67 geographic groups (simply based on the zip
codes).

I tried to do the raking first with a smaller number of geographicgroups

(10) but the results were more accurate with all groups.

The variable I have are:

age = continuos variable containg the age of the subject at thetime ofsampling dist_study = continuous variable containing the distancefrom theindividual to me. age_grp = categorial variable - 15 age strata.geo_grp =zip code quest = 1 if individual returned a filled outquestionnaire pop = 1if individual was amongst the 4975 in the original sample (all hadof course

pop=1) sample = 1 for each finally included subject.

The do file looks like this:

*************
*To get data from the orginal population
tabstat age
tabstat dist_study

*Raking starts by generating totals in each age group andgeographical groupegen tot_age_grp = count(pop),by(age_grp) egen tot_age_grp_q =count(pop)

if quest==1, by(age_grp)

egen tot_geo_grp =  count(pop),by(geo_grp)

egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp) *Initalweight is

generated gen weight1x = (tot_age_grp / tot_age_grp_q)

keep if quest==1
			*(reducing the dataset to 3743 men)
survwgt rake  weight1x,   ///
        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight2x)

svyset  [pweight=weight2x], strata(age_grp)

*Description
svydes

*Now we estimate the average age in the 4975 men from the 3743 mensvymeanage *Now we estimate the average distance to travel to get to mefor the

4975 men based on the 3743 men svymean  dist_study

*These are the actual numbers for the 3743 men.
tabstat age
tabstat dist_study
******************

The output from Stat8 is:

. *************
. tabstat age

    variable |      mean
-------------+----------
         age |   66.6695
------------------------

. tabstat dist_study

    variable |      mean
-------------+----------
  dist_study |  25.90153
------------------------

.
.
. egen tot_age_grp =  count(pop),by(age_grp)

. egen tot_age_grp_q = count(pop) if quest==1, by(age_grp) (1232missing

values generated)

.
. egen tot_geo_grp =  count(pop),by(geo_grp)

. egen tot_geo_grp_q = count(pop) if quest==1, by(geo_grp) (1232missing

values generated)

.
. gen weight1x = (tot_age_grp / tot_age_grp_q)
(1232 missing values generated)

.
. keep if quest==1
(1232 observations deleted)

.                         *(reducing the dataset to 3743 men)
. survwgt rake  weight1x,   ///

        by(age_grp  geo_grp) ///
        totvars(tot_age_grp tot_geo_grp) ///
        gen(weight2x)


.
. svyset  [pweight=weight2x], strata(age_grp)
pweight is weight2x
strata is age_grp

.
. svydes

pweight:  weight2x
Strata:   age_grp
PSU:      <observations>
                                      #Obs per PSU
 Strata                       ----------------------------
 age_grp    #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       346       346         1       1.0         1
       2       333       333         1       1.0         1
       3       304       304         1       1.0         1
       4       297       297         1       1.0         1
       5       284       284         1       1.0         1
       6       275       275         1       1.0         1
       7       249       249         1       1.0         1
       8       246       246         1       1.0         1
       9       231       231         1       1.0         1
      10       209       209         1       1.0         1
      11       212       212         1       1.0         1
      12       210       210         1       1.0         1
      13       184       184         1       1.0         1
      14       174       174         1       1.0         1
      15       189       189         1       1.0         1
--------  --------  --------  --------  --------  --------
      15      3743      3743         1       1.0         1

.
. svymean  age

Survey mean estimation

pweight:  weight2x                                Number of obs    =
3743
Strata:   age_grp                                 Number of strata =
15
PSU:      <observations>                          Number of PSUs   =
3743

Population size= 4975

----------------------------------------------------------------------------

--
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff

---------+--------------------------------------------------------------

---------+----
--
     age |   66.66605    .0067455    66.65283    66.67928    .0092211

----------------------------------------------------------------------------

--

. svymean  dist_study

Survey mean estimation

pweight:  weight2x                                Number of obs    =
3742
Strata:   age_grp                                 Number of strata =
15
PSU:      <observations>                          Number of PSUs   =
3742
                                                  Population size  =
4973.7235

----------------------------------------------------------------------------

--
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff

---------+--------------------------------------------------------------

---------+----
--
dist_s~y |   25.90772    .3139459     25.2922    26.52325     1.01731

----------------------------------------------------------------------------

--

.
. tabstat age

    variable |      mean
-------------+----------
         age |   66.5895
------------------------

. tabstat dist_study

    variable |      mean
-------------+----------
  dist_study |  25.93867
------------------------

.
end of do-file

As one can see the average age amongst the 4975 men is: 66.6695

Using raking and svymean Stata estimates the average age amongstthe 4975

men based on the information from the 3743 men to be: 66.66605

As one can see those are quite similar.

Now let us look at the distance to travel. We raked on zip codeswhich arenot equivalent to distances but despite that the results are quiteamazing:


We know the average distance to travel is: 25.90153 km

After raking and basing the results on the 3743 men Stata estimatesthe

distance to be: 25.90772 km

Strikingly similar. The true distributions amongst the 3743 are not as
close: 66.5895 years and 25.93867 kms, but really not that far off.

The differences will be far greater when raking the 600.

I will now go on.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
  - From: "Kristian Wraae" <[email protected]>

References:
- SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
  - From: "Kristian Wraae" <[email protected]>

Prev by Date: Re: st: is there a -regress- command equivalent to the pair matched ttest?
Next by Date: Re: st: Removing the Stata logo from graphs in LaTeX documents
Previous by thread: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Next by thread: SV: SV: SV: st: Survey - raking - calibration - post stratification - calculating weights
Index(es):
- Date
- Thread