Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: stata and weighting


From   Stephen McKay <S.McKay@bristol.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   st: stata and weighting
Date   Thu, 11 Mar 2004 09:52:55 +0000

Many (perhaps most) social survey datasets come with non-integer 
weights, reflecting a mix of the sampling schema (e.g. one person per 
household randomly selected), and sometimes non-response, and sometimes 
calibration/grossing factors too.  Increasingly, in the name of 
confidentiality, data depositors are reluctant to identify too much 
about the sampling points -- thus making PSU identification not always 
possible [and hence svy approaches in stata not really practicable].

At present, stata will let you use some types of weights, some of the 
time, on some types of command.  The logic of which is hard to fathom.

I appreciate that a simple-minded application of weights will give you 
incorrect confidence intervals.  But at present stata makes it 
difficult to get the right point estimates in these circumstances. 
Here's a very simple example of what can happen, based on a simple 
indicator variable and a simple weight.

. list

     +--------------+
     | male     wgt |
     |--------------|
  1. |    0     1.5 |
  2. |    0     1.2 |
  3. |    1      .7 |
  4. |    1     1.1 |
  5. |    0      .7 |
     |--------------|
  6. |    1      .8 |
     +--------------+

. su male [w=wgt]  /// So summarize defaults to aweights.
(analytic weights assumed)

    Variable |     Obs      Weight        Mean   Std. Dev.       Min   
Max
-------------+-----------------------------------------------------------------
        male |       6  6.00000006    .4333333   .5428321          0   
1

. tab1 male [w=wgt]  /// tab1 defaults to frequency weights, not allowed
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. tab1 male [iw=wgt] /// tab1 disallows iweights
iweight not allowed
r(101);

. tab1 male [aw=wgt] /// tab1 disallows aweights
aweight not allowed
r(101);

. table male [w=wgt]  /// table defaults to freq weights, too
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. table male [aw=wgt] /// aweights gives you the "wrong" answers, 
through rouding off to integers

----------------------
     male |      Freq.
----------+-----------
        0 |          3
        1 |          3
----------------------

. table male [iw=wgt]  /// iweights give you the "right" answers

----------------------
     male |      Freq.
----------+-----------
        0 |        3.4
        1 |        2.6
----------------------

. tab male [w=wgt]
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. tab male [aw=wgt]  /// aweights with tab gives "right" answers

       male |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |3.400000002       56.67       56.67
          1 |2.599999998       43.33      100.00
------------+-----------------------------------
      Total |          6      100.00

. tab male [iw=wgt]  /// iweights with tab gives "right" answers, but 
with different rounding!

       male |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 3.40000004       56.67       56.67
          1 | 2.60000002       43.33      100.00
------------+-----------------------------------
      Total | 6.00000006      100.00

. log close

Again, not sure the logic of some of these differences, for these 
perhaps the most simple of commands.

I doubt there is much call for an nw option (naive weight)?  But 
otherwise for some analysis one is reduced to multiplying and/or 
rounding off weights to get the point estimates that the data 
depositors/creators tell you that you should be getting (i.e. the ones 
in their report).  Such as:

gen wgt2=wgt*10
compress
. tab1 male [w=wgt2]  /// Right proportions, wrong 'bases'
(frequency weights assumed)

-> tabulation of male  

       male |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         34       56.67       56.67
          1 |         26       43.33      100.00
------------+-----------------------------------
      Total |         60      100.00


Surely there should be something better than this?

Steve


Date: Wed, 10 Mar 2004 23:29:46 -0500
From: Richard Williams <Richard.A.Williams.5@nd.edu>
Subject: Re: st: non-integer frequencies?
At 09:49 PM 3/10/2004 -0600, ACHINTYA RAY wrote:
>Sample surveys oftentimes provide weights to convert sample estimates 
into
>representative population figures. Sometimes such frequency weights 
are not
>integers (For example, National Health and Nutrition Examination Survey
>III). It seems that Stata can only deal with integer frequency 
weights. Is
>there a solution? The best that I can do right now is to take the 
nearest
>integer to the non-integer frequencies. This method seems rather 
adhoc. Any
>help will be deeply appreciated.

I think iweights will work, at least if the command allows the use of 
iweights.  e.g. I just tried

. sum income

     Variable |       Obs        Mean    Std. Dev.       Min        Max
- -------------+--------------------------------------------------------
       income |       500       27.79    8.973491          5       48.3

. sum income [fw=1.2]
may not use noninteger frequency weights
r(401);

. sum income [iw=1.2]

     Variable |     Obs      Weight        Mean   Std. 
Dev.       Min        Max
- 
-------------+-----------------------------------------------------------------
       income 
|     500         600       27.79   8.971993          5       48.3

However, remember that, for purposes of statistical inference, the 
numbers 
you get are wrong.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index