st: stata and weighting

 From Stephen McKay To statalist@hsphsun2.harvard.edu Subject st: stata and weighting Date Thu, 11 Mar 2004 09:52:55 +0000

```Many (perhaps most) social survey datasets come with non-integer
weights, reflecting a mix of the sampling schema (e.g. one person per
household randomly selected), and sometimes non-response, and sometimes
calibration/grossing factors too.  Increasingly, in the name of
confidentiality, data depositors are reluctant to identify too much
about the sampling points -- thus making PSU identification not always
possible [and hence svy approaches in stata not really practicable].

At present, stata will let you use some types of weights, some of the
time, on some types of command.  The logic of which is hard to fathom.

I appreciate that a simple-minded application of weights will give you
incorrect confidence intervals.  But at present stata makes it
difficult to get the right point estimates in these circumstances.
Here's a very simple example of what can happen, based on a simple
indicator variable and a simple weight.

. list

+--------------+
| male     wgt |
|--------------|
1. |    0     1.5 |
2. |    0     1.2 |
3. |    1      .7 |
4. |    1     1.1 |
5. |    0      .7 |
|--------------|
6. |    1      .8 |
+--------------+

. su male [w=wgt]  /// So summarize defaults to aweights.
(analytic weights assumed)

Variable |     Obs      Weight        Mean   Std. Dev.       Min
Max
-------------+-----------------------------------------------------------------
male |       6  6.00000006    .4333333   .5428321          0
1

. tab1 male [w=wgt]  /// tab1 defaults to frequency weights, not allowed
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. tab1 male [iw=wgt] /// tab1 disallows iweights
iweight not allowed
r(101);

. tab1 male [aw=wgt] /// tab1 disallows aweights
aweight not allowed
r(101);

. table male [w=wgt]  /// table defaults to freq weights, too
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. table male [aw=wgt] /// aweights gives you the "wrong" answers,
through rouding off to integers

----------------------
male |      Freq.
----------+-----------
0 |          3
1 |          3
----------------------

. table male [iw=wgt]  /// iweights give you the "right" answers

----------------------
male |      Freq.
----------+-----------
0 |        3.4
1 |        2.6
----------------------

. tab male [w=wgt]
(frequency weights assumed)
may not use noninteger frequency weights
r(401);

. tab male [aw=wgt]  /// aweights with tab gives "right" answers

male |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |3.400000002       56.67       56.67
1 |2.599999998       43.33      100.00
------------+-----------------------------------
Total |          6      100.00

. tab male [iw=wgt]  /// iweights with tab gives "right" answers, but
with different rounding!

male |      Freq.     Percent        Cum.
------------+-----------------------------------
0 | 3.40000004       56.67       56.67
1 | 2.60000002       43.33      100.00
------------+-----------------------------------
Total | 6.00000006      100.00

. log close

Again, not sure the logic of some of these differences, for these
perhaps the most simple of commands.

I doubt there is much call for an nw option (naive weight)?  But
otherwise for some analysis one is reduced to multiplying and/or
rounding off weights to get the point estimates that the data
depositors/creators tell you that you should be getting (i.e. the ones
in their report).  Such as:

gen wgt2=wgt*10
compress
. tab1 male [w=wgt2]  /// Right proportions, wrong 'bases'
(frequency weights assumed)

-> tabulation of male

male |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         34       56.67       56.67
1 |         26       43.33      100.00
------------+-----------------------------------
Total |         60      100.00

Surely there should be something better than this?

Steve

Date: Wed, 10 Mar 2004 23:29:46 -0500
From: Richard Williams <Richard.A.Williams.5@nd.edu>
Subject: Re: st: non-integer frequencies?
At 09:49 PM 3/10/2004 -0600, ACHINTYA RAY wrote:
>Sample surveys oftentimes provide weights to convert sample estimates
into
>representative population figures. Sometimes such frequency weights
are not
>integers (For example, National Health and Nutrition Examination Survey
>III). It seems that Stata can only deal with integer frequency
weights. Is
>there a solution? The best that I can do right now is to take the
nearest
>integer to the non-integer frequencies. This method seems rather
>help will be deeply appreciated.

I think iweights will work, at least if the command allows the use of
iweights.  e.g. I just tried

. sum income

Variable |       Obs        Mean    Std. Dev.       Min        Max
- -------------+--------------------------------------------------------
income |       500       27.79    8.973491          5       48.3

. sum income [fw=1.2]
may not use noninteger frequency weights
r(401);

. sum income [iw=1.2]

Variable |     Obs      Weight        Mean   Std.
Dev.       Min        Max
-
-------------+-----------------------------------------------------------------
income
|     500         600       27.79   8.971993          5       48.3

However, remember that, for purposes of statistical inference, the
numbers
you get are wrong.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```