Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: How to set calibrated weights


From   Veronica Galassi <veronicagalassi@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: How to set calibrated weights
Date   Thu, 25 Oct 2012 14:33:59 +0100

Dear Steve,

Many thanks for your email!

I have checked and in the first wave the variable w1_hhcluster is the PSU.
In fact, by using the statement: "svyset w1_hhcluster[pw= w1_wgt],
strata( w1_hhdc)" the PSUs sum up to 400, as expected.

   pweight: w1_wgt
          VCE: linearized
  Single unit: missing
     Strata 1: w1_hhdc
         SU 1: w1_hhcluster
        FPC 1: <zero>

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max
--------  --------  --------  --------  --------  --------
       1         7       303        28      43.3        74
       2        10       436        27      43.6        58
       3        12       445        22      37.1        60
       4         7       310        25      44.3        78
       5         6       303        37      50.5        56
       6         6       318        45      53.0        73
       7         7       212         6      30.3        51
       8         7       297        24      42.4        53
       9         6       289        29      48.2        73
      10         8       229        13      28.6        46
      12         8       210        14      26.2        45
      13         9       348         6      38.7        53
      14         6       281        33      46.8        66
      15         7       424        46      60.6        74
      16         7       226         5      32.3        48
      17         6       222        23      37.0        48
      18         7       297         8      42.4        67
      19         5       176        28      35.2        52
      20         6       160         2      26.7        54
      21         9       564        28      62.7        85
      22         9       397        24      44.1        69
      23         8       465        42      58.1        98
      24         7       328        12      46.9        69
      25         8       360        32      45.0        66
      26         7       397        26      56.7        82
      27         5       222        34      44.4        55
      28         7       365        30      52.1        70
      29         9       368         2      40.9        67
      30        11       471         5      42.8        57
      31         9       324        24      36.0        52
      32         8       330        13      41.2        73
      33         7       246        17      35.1        47
      34         7       223         7      31.9        62
      35         6       275        33      45.8        77
      36         9       393         5      43.7        83
      37         7       256         8      36.6        67
      38         8       379        37      47.4        58
      39         6       348        37      58.0        75
      40         7       167         1      23.9        58
      42         8       345        31      43.1        53
      43         7       294        32      42.0        54
      44         5       250        38      50.0        57
      76         9       353        27      39.2        61
      81         5       218        21      43.6        55
      82         7       227        24      32.4        39
      83         7       354        11      50.6        65
      84         4       155        25      38.8        51
      88         6       187         4      31.2        60
     171        10       443        26      44.3        67
     275        10       404         7      40.4        64
     572        10       525        16      52.5       114
     773         9       314        12      34.9        52
     774        12       451        16      37.6        55
--------  --------  --------  --------  --------  --------
      53       400     16884         1      42.2       114


Thanks to you I also managed to find the "cluster" variable in the
second wave. Since data are spread among 8 datasets, I did not find it
before!
However, wen running the statement "svyset cluster[pw= w2_wgt],
strata( w2_gc_dc )"
the PSUs do not sum up to 400 anymore, as you can see below...is there
any explanation for that?
Does it mean that "cluster" variable is not the PSU for the second
wave? It must be otherwise they would not have labelled it "Original
wave 1 sample cluster".

Survey: Describing stage 1 sampling units

      pweight: w2_wgt
          VCE: linearized
  Single unit: missing
     Strata 1: w2_gc_dc
         SU 1: cluster
        FPC 1: <zero>

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max
--------  --------  --------  --------  --------  --------
       1         8       234         3      29.2        53
       2        15       469         1      31.3        67
       3        14       363         2      25.9        58
       4        11       214         1      19.5        56
       5         6       280        21      46.7        68
       6         8       307         1      38.4        63
       7         8       183         2      22.9        46
       8         9       315         1      35.0        67
       9        11       302         1      27.5        73
      10        13       210         1      16.2        39
      12        20       204         1      10.2        34
      13        15       431         1      28.7        64
      14        11       296         1      26.9        64
      15        13       425         1      32.7        76
      16         8       209         1      26.1        50
      17        11       222         1      20.2        46
      18        10       265         1      26.5        57
      19         7       153         1      21.9        38
      20        11       173         1      15.7        55
      21        10       638         1      63.8        99
      22        16       455         1      28.4        88
      23        12       651         1      54.2       121
      24        10       443         2      44.3       112
      25        12       573         1      47.8       108
      26         7       405        11      57.9       102
      27         9       206         1      22.9        51
      28        11       478         1      43.5        89
      29        11       388         1      35.3        73
      30        18       511         1      28.4        64
      31        24       375         1      15.6        54
      32        11       359         1      32.6        87
      33        13       328         1      25.2        69
      34        15       245         1      16.3        71
      35        18       317         1      17.6        82
      36        13       440         2      33.8        93
      37        23       278         1      12.1        64
      38        16       442         1      27.6        76
      39        10       376         1      37.6        76
      40        18       154         1       8.6        45
      42        14       347         1      24.8        62
      43        10       400         1      40.0        77
      44         8       236         2      29.5        58
      76        40       374         1       9.3        75
      81        11       237         1      21.5        58
      82         9       187         2      20.8        45
      83        14       384         1      27.4        72
      84         5       205         1      41.0        69
      88        33       233         1       7.1        64
     171        30       474         1      15.8        67
     275        13       403         1      31.0        56
     572        31       665         1      21.5       112
     773        30       285         1       9.5        39
     774        59       505         1       8.6        60
--------  --------  --------  --------  --------  --------
      53       793     18252         1      23.0       121

                       18285 = #Obs with missing values in the
                    --------   survey characteristics
                       36537



2012/10/24 Steve Samuels <sjsamuels@gmail.com>:
> I looked at the documentation some more; the corresponding variable in
> Wave 2 is "cluster". You could have discovered this for yourself by
> typing "lookfor cluster", which would identify any variable whose name
> or label contained "cluster".
>
> But I am not sure that this _is_ the PSU, though used in the (incorrect)
> published Stata example. According to: Methodology: Report on NIDS Wave
> 1 Technical Paper no. 1, page 9, quoted below, "cluster" is the name of
> the *second* stage sampling unit, not the *primary* sampling unit. Double
> check with your contact.
>
>
> Quote from Methodology Report, page 8
>
> 3.2 Sample of dwelling units At the time that the Master Sample was
> compiled, 8 non-overlapping samples of dwelling units were
> systematically drawn within each PSU. Each of these samples is called a
>  “cluster” by Stats SA. These clusters were then allocated to the various
>  household surveys that were conducted by Stats SA between 2004 and 2007.
> However, two clusters in each PSU were never used by Stats SA and these were
>  allocated to NIDS.
>
> Steve
>
>>
>>
>>
>> On October 23, Veronica Galassi wrote:
>>
>>
>> As you correctly said, looking at the wave 1 it is possible to
>> understand that the PSU variable is "w1_hhcluster".
>> However, this variable is missing in wave 2 so I contacted the person
>> responsible for the data management of the survey and they should
>> provide me with this variable soon!
>>
>> Many thanks again for your support, dear Steve, and for the passion
>> you put on helping people in trouble with Stata!
>>
>> All the best,
>>
>> Veronica
>>
>>
>> Veronica:
>>
>> "Introduction to Wave 1 Data May 2012"
>>
>> I look through the NIDS web site information for Wave 2 and finally resorted to a Google Search for ' "NIDS
>> svyset and got a hit to "Introduction to Wave 1 Data May 2012" at: http://www.nids.uct.ac.za/home/index.php?/Nids-Documentation/documents.html
>>
>> There is the statement :
>> "In Stata the recommended svyset command is svyset [pw= w1_wgt], strata(w1_hhdc) psu( w1_hhcluster)."
>>
>> This is incorrect syntax.  The proper syntax for -svyset- would be
>
> *************************************************
> svyset w1_hhcluster [pw= w1_wgt], strata(w1_hhdc)
> ************************************************
>
> Now, you have to find the equivalent w2_ variables.  The clue to the PSU is that it takes on 400 unique values. It might be w2_hhcluster, but is could be w2_hhgeo. which you picked out as a cluster variable.
>
> So the correct statement is likely to be either:
> *************************************************
> svyset w2_hhcluster [pw= w2_wgt], strata(w2_hhdc)
> svydes
> ************************************************
> OR
> *************************************************
> svyset  w2_hhgeo [pw= w2_wgt], strata(w2_hhdc)
> svydes
> ************************************************
>
> The one with #units = 400 distinct values is correct  If both show 400 units, see which one  reproduces Table 2 of the  "Introduction to Wave 1 Data May 2012". The following code can help do this:
> *************************************
> egen t_geo = tag(ww2_hhgeo)
> egen t_cluster = tag(ww2_hhcluster)
> tab w2_gc_prov if t_geo
> tab w2_gc_prov if t_cluster
> *************************************
>
> So it is up to you to do the detective work and to study about survey design.
> Good luck.
>
> Steve
>
>
>
>
>
> On Oct 21, 2012, at 5:05 AM, Veronica Galassi wrote:
>
> Dear Steve,
>
> Thank you very much for your time.
>
> This is the quote from the document describing the sampling
> methodology (Methodology: Report on NiDS Wave 1, page 9). This
> technical document and the one explaining how weights have been built
> can be found here:
> http://www.nids.uct.ac.za/home/index.php?/Nids-Documentation/technical-papers.html.
> "A stratified, two-stage cluster design was employed to be included in
> the base wave. In the first stage, 400 PSUs where included from Stats
> SA's 2003 Master Sample of 3,000 PSUs...A PSU is defined as a
> geographical area that consists of at least one Enumeration Area (EA)
> or several EAs from the 2001 census...In some cases it has been
> necessary to add EAs to the original EA to meet the requirement of a
> minimum of 74 households per PSU."
> I tried to contact the organisation responsible for the survey asking
> for more info regarding the PSU but they did not come back to me. The
> reason why I called the clusters "cluster 1" and "cluster 2" is just
> to distinguish them from each other. In the above-mentioned document
> there is no clear reference to province and geographical type being
> cluster 1 and 2. Looking at the variables in the dataset and reading
> the documents, I deduced they were the two clusters in question.
>
> This is what I typed when I tried not to specify the PSU:
> "svyset [pw=w2_wgt], strata ( w2_gc_dc)|| w2_hhgeo|| w2_gc_prov"
> And this is the error I got back (r198):"invalid use of _n;
> observations can only be sampled in the final stage".
>
> Yes, I tried to set the weights following the statement: "w2_gc_prov
> [pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo" followed by svydes.
> This is the output:
>
>                                  #Obs per Unit
>                             ----------------------------
> Stratum    #Units     #Obs      min       mean      max
> --------  --------  --------  --------  --------  --------
>      1         1*      234       234     234.0       234
>      2         1*      469       469     469.0       469
>      3         1*      363       363     363.0       363
>      4         1*      214       214     214.0       214
>      5         1*      280       280     280.0       280
>      6         1*      307       307     307.0       307
>      7         1*      183       183     183.0       183
>      8         1*      315       315     315.0       315
>      9         1*      302       302     302.0       302
>     10         1*      210       210     210.0       210
>     12         1*      204       204     204.0       204
>     13         1*      431       431     431.0       431
>     14         1*      296       296     296.0       296
>     15         1*      425       425     425.0       425
>     16         1*      209       209     209.0       209
>     17         1*      222       222     222.0       222
>     18         1*      265       265     265.0       265
>     19         1*      153       153     153.0       153
>     20         1*      173       173     173.0       173
>     21         1*      638       638     638.0       638
>     22         1*      455       455     455.0       455
>     23         1*      651       651     651.0       651
>     24         1*      443       443     443.0       443
>     25         1*      573       573     573.0       573
>     26         1*      405       405     405.0       405
>     27         1*      206       206     206.0       206
>     28         1*      478       478     478.0       478
>     29         1*      388       388     388.0       388
>     30         1*      511       511     511.0       511
>     31         1*      375       375     375.0       375
>     32         1*      359       359     359.0       359
>     33         1*      328       328     328.0       328
>     34         1*      245       245     245.0       245
>     35         1*      317       317     317.0       317
>     36         1*      440       440     440.0       440
>     37         1*      278       278     278.0       278
>     38         1*      442       442     442.0       442
>     39         1*      376       376     376.0       376
>     40         1*      154       154     154.0       154
>     42         1*      347       347     347.0       347
>     43         1*      400       400     400.0       400
>     44         1*      236       236     236.0       236
>     76         2       374       124     187.0       250
>     81         2       237        50     118.5       187
>     82         2       187         3      93.5       184
>     83         2       384        73     192.0       311
>     84         2       205         2     102.5       203
>     88         2       233        14     116.5       219
>    171         1*      474       474     474.0       474
>    275         1*      403       403     403.0       403
>    572         1*      665       665     665.0       665
>    773         1*      285       285     285.0       285
>    774         1*      505       505     505.0       505
> --------  --------  --------  --------  --------  --------
>     53        59     18252         2     309.4       665
>
>                       3703 = #Obs with missing values in the
>                   --------   survey characteristics
>                      21955
>
>
> After having set the weights in this way, I tried to conduct some
> descriptive statistics by typing:"svy: mean (tot_grem_k) if
> tot_grem_k>0 & w2_a_cgprv1!=10"
> I got back the mean but the standard errors were missing. In fact,
> Stata gave me back the following note:"Note: missing standard error
> because of stratum with single sampling unit.",as it is clearly shown
> in the table above.
>
> I hope this clarifies the sampling methodology a bit.
> Thank you so much for your precious help, I am learning a lot from
> your comments!!!
>
> Kind regards,
>
> Veronica
>
>
>
>
> 2012/10/20 Steve Samuels <sjsamuels@gmail.com>:
>>>
>>> On Oct 20, 2012, at 5:08 AM, Veronica Galassi wrote:
>>>
>>> Dear Steve,
>>>
>>> Thank you very much for your kind reply and the useful references!
>>> Your answer actually clarified many other doubts I had.
>>>
>>> Your intuition that my post-stratified weights are calibrated is
>>> correct. Unfortunately, I checked again the documents explaining the
>>> sampling methodology and there the PSU is simply defined as a
>>> geographic area containing more than 74 dwellings. Therefore I expect
>>> the number of PSU to be high (around 3,000) whereas I only have 9
>>> provinces and 4 geographical types in my survey. This implies that
>>> none of my cluster variables can be the PSU.
>>
>> You still haven't persuaded me. I'd have to see the quote from the study
>> documents. Or, better, post a link to them if they are online. You'd
>> better figure out what role, if any, the cluster variables have in the
>> design. Why did you name them "cluster 1" and "cluster 2"?
>>> However, if I got your point, it does not really matter which PSU I
>>> indicate when conducting descriptive statistics. Is it correct?
>>
>> No, it is not. It is scientifically irresponsible to publish estimates
>> of descriptive statistics without indications of uncertainty (SEs, CIs).
>>
>>> For
>>> this reason, I also tried not to indicate any PSU but Stata gave me
>>> back the error: "invalid use of _n; observations can only be sampled
>>> in the final stage".
>> See FAQ Section 3.3 First stence
>>
>>> To cut it short, do you still believe I can use the statement "svyset
>>> w2_gc_prov [pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo" you previously
>>> indicated to set my calibrated weigths? ( In my case I cannot use the
>>> fpc option).
>>
>> I don't know, because you have not yet correctly described the sampling
>> design. As an aside, ave you even tried the statement, which assumed
>> that w2_gc_prov is the OSY? When you do, follow it by -svydes-.
>>
>>>
>> 2012/10/20 Steve Samuels <sjsamuels@gmail.com>:
>>> Veronica,
>>>
>>> The PSU variable is not missing. It is the sampling unit at the first
>>> stage of sampling and it's one of your cluster variables, probably
>>> "cluster 1" (check). Your statement that one must know the PSU variable
>>> to use probability weights is also incorrect. One can get proper
>>> weighted estimates, though not standard errors, without knowing the PSU.
>>>
>>> I'm not sure what wrong with your -concat- statement. I would have
>>> used "egen combination = group()". For it to have worked, the value of
>>> the "post-stratification weight" would have to be the population count
>>> for each combination of the three variables.
>>>
>>> If the "post-stratification" weights are not integers, they are probably
>>> "calibration" weights that have already adjusted the probability
>>> weights. In that case, further post-stratification are likely to be
>>> superfluous. You would  then use the "post-stratification weight" in place of
>>> the probability weights. All weights should be
>>> described in the study documents (though usually not the"codebook"). If
>>> they are not, then contact the organization that did the study for
>>> details.
>>>
>>> If sampling was without replacement at one or more stages,
>>> you could use the fpc() option for those stages. In practice,
>>> it makes a difference only for the first stage.
>>>
>>> In any case, one guess at a -svyset- statement (assuming the
>>> "post-stratification weight" is a "calibration" weight) is:
>>> *************************************************************
>>> svyset w2_gc_prov [pw = w2_wgt], strata(w2_gc_dc) || w2_hhgeo
>>> **************************************************************
>>>
>>> But I could be wrong, depending on how w2_wgt was calculated.
>>>
>>> Before proceeding, I suggest that you learn more about sampling or take
>>> a survey course. I gave some references in:
>>> http://www.stata.com/statalist/archive/2012-09/msg01058.html.
>>> The Stata survey manual is also a very good resource, though the section on
>>> post-stratification is skimpy.
>>>
>>> Steve
>>>
>>>
>>> On Oct 19, 2012, at 1:57 PM, Veronica Galassi wrote:
>>>
>>> Dear Statalisters,
>>>
>>> I am writing you concerning the application of calibrated weights to
>>> my dataset for the computation of descriptive statistics only.
>>>
>>> The dataset I am working on collects information at household and
>>> individual level and comes from a stratified, two-stage clustered
>>> sample. The followings are the variables I have got:
>>> - probability weights: w2_dwgt
>>> - strata: w2_gc_dc
>>> - cluster 1: w2_gc_prov
>>> - cluster 2: w2_hhgeo
>>> - post-stratified weights: w2_wgt
>>> - age intervals:  w2_age_intervals
>>> - gender: w2_best_gen
>>> - population group: w2_best_race
>>>
>>> In order to set the probability weights using the command svyset, I
>>> need the psu variable. As you may have noticed, this variable is
>>> missing and this makes me impossible to set pweights.
>>> In addition, from a couple of previous statalist conversations ( see
>>> in particular: http://www.ats.ucla.edu/stat/stata/faq/svy_stata_post.htm
>>> and http://www.stata.com/statalist/archive/2012-02/msg00584.html), I
>>> understood that:
>>> - when using calibrated weights I still have to set pweights and
>>> specify the original strata and clusters
>>> - In order to apply calibrated data I need to know the characteristics
>>> on the base of which the sample have been post-stratified ( in my case
>>> age intervals, gender and population groups).
>>>
>>> Therefore, I tried to set my post-stratified weights using the
>>> following command:
>>> "svyset [pw=w2_dwgt], strata (w2_gc_dc) poststrata (w2_age_intervals
>>> w2_best_gen w2_best_race) postweight(w2_wgt)"
>>> which did not work because in Stata the poststrata must be mutually
>>> exclusive and thus only one variable can be specified.
>>>
>>> In order to overcome this problem, I tried to generate a variable
>>> which is a combination of the three characteristics by using the
>>> command
>>> "egen combination=concat( w2_age_intervals w2_best_race w2_best_gen),
>>> format (float)".
>>> However, this command generated a variable containing only missing
>>> values and for this reason Stata gave me back the error:
>>> "option postweight() requires option poststrata()".
>>> The only way to make Stata set the post-calibrated weight was by using
>>> the command
>>> "svyset, poststrata (combination) postweight(w2_wgt)" with combination
>>> being a string variable. However I am scared that this command is not
>>> complete.
>>>
>>> At this point, I would really appreciate any hint on what I am doing
>>> wrong and how to proceed to set my post-stratified weights.
>>>
>>> Many thanks for your help!
>>>
>>> Kind regards,
>>>
>>> Veronica Galassi
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index