Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Weights with -table- and -tabulate-


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Weights with -table- and -tabulate-
Date   Wed, 18 Dec 2002 14:27:10 -0000

Friedrich Huebler
>
> I have two questions on the use of weights with -table- and
> -tabulate-.
>
> (1) Can the frequencies be rounded when -tabulate- is used with
> weights. My weight looks like this:
>
> Variable |     Obs        Mean   Std. Dev.       Min        Max
> ---------+-----------------------------------------------------
>  sweight |   28791    1.004766    .127654    .787363   1.231606
>
> The command
>
> tab male [aw=sweight]
>
> yields this table:
>
>  Male |      Freq.     Percent        Cum.
> ------+-----------------------------------
>     0 | 14893.9698       51.73       51.73
>     1 | 13897.0302       48.27      100.00
> ------+-----------------------------------
> Total |      28791      100.00
>
> I prefer the frequencies to be shown as 14894 and 13897. Can this be
> done?
>
> (2) Why has the weight no effect on the output of -table-?
>
> When I type
>
> table male [aw=sweight]
>
> I get this table:
>
> ----------------------
>      Male |      Freq.
> ----------+-----------
>         0 |     14,862
>         1 |     13,929
> ----------------------
>
> This is the same as the unweighted frequency distribution.
>
> I looked at the manuals, the FAQs, and the list archive and found no
> answer to these questions. I use Stata 7.

Interesting!

The sequence here seems to be that Friedrich can
get the results he wants in -tabulate-, but not the
format, and he can get the format he wants in -table-,
but not the results.

I don't have Friedrich's data, so I will use the auto
data to illustrate a reply, although there is perhaps
one unresolved question here which only Stata Corp
can answer definitively: why the difference in behaviour?

Here it is shown with the auto data:

. tab foreign [aw=mpg]

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic | 48.4098985       65.42       65.42
    Foreign | 25.5901015       34.58      100.00
------------+-----------------------------------
      Total |         74      100.00

. table foreign [aw=mpg]

----------------------
 Car type |      Freq.
----------+-----------
 Domestic |         52
  Foreign |         22
----------------------

Here's my way of thinking about it.

Suppose I say to you: count the categories of -foreign-,
given these values of -mpg- as weights.

One interpretation -- that taken by -table- -- is
that the weights are irrelevant. -table- will count for
you, but the weights don't enter into _counting_.
And in many contexts, we do want the raw frequencies,
unweighted, and also other statistics weighted by something.

This is perhaps startling, and I think should be better
documented, but I don't think it is a bug. If
you also say: give the mean of -weight-, then Stata
pays attention to -mpg- supplied as weight.
(Incidentally, -tabstat- behaves the same way.)

There is a clear difference between

. table foreign  , c(freq mean weight)

--------------------------------------
 Car type |        Freq.  mean(weight)
----------+---------------------------
 Domestic |           52       3,317.1
  Foreign |           22       2,315.9
--------------------------------------

and

. table foreign [aw=mpg] , c(freq mean weight)

--------------------------------------
 Car type |        Freq.  mean(weight)
----------+---------------------------
 Domestic |           52       3,174.2
  Foreign |           22       2,240.6
--------------------------------------

The other interpretation -- that taken by -tabulate- --
is that you want -- as you evidently do -- a list of

    sum of weights in category / mean of weights overall

which has the property that it sums to the
total frequency.

You want to see that, but formatted in the way
you want. I don't think -tabulate- can do this.
It has no -format()- option, and it pays
no attention to variable display formats when
showing frequencies. In
addition, -tabulate- can show all sorts
of different results and it is not clear
that the same format would ever be
appropriate for all. (One answer to that would
be to allow multiple formats via more
complicated syntax.)

One remedy is to calculate directly what you want
to show and then show it with -tabdisp-. -tabdisp- is
documented at [P] tabdisp
as if it were an only-for-the-technical command, but
it is very useful interactively as well. (-foreach-
and -forvalues- fall into the same category.)

Elsewhere another tabulation problem otherwise
awkward has been shown to yield to some
calculations and -tabdisp-. See
How do I tabulate cumulative frequencies?
http://www.stata.com/support/faqs/data/tabdisp.html

Here is a laboured way of doing it. It has one
advantage. I may not be the only person who --
even though the manipulations here are elementary --
can get confused in this terrain unless I write
down the formulas and play with simple examples
step by step, and this route takes you where
you want to be in very easy stages. I will
go through a basic sequence and then make
some comments.

1. We want the sum of weights in each category

. egen sumw = sum(mpg) , by(foreign)

2. We want the mean of weights overall

. egen meanw = mean(mpg)

3. Our weighted frequencies are then just

. gen freq = sumw/meanw

4. By construction, these are constant within
each category, so -tabdisp- is easy

. tabdisp foreign, c(freq)

----------------------
 Car type |       freq
----------+-----------
 Domestic |    48.4099
  Foreign |    25.5901
----------------------

5. And we can control the format:

. tabdisp foreign, c(freq) format(%2.0f)

----------------------
 Car type |       freq
----------+-----------
 Domestic |         48
  Foreign |         26
----------------------

Comment A
=========

We could do this via -table- but
it is not as nice:

. table foreign  , c(mean freq) format(%2.0f)

----------------------
 Car type | mean(freq)
----------+-----------
 Domestic |         48
  Foreign |         26
----------------------

or (among other possibilities)

. table foreign  , c(min freq) format(%2.0f)

----------------------
 Car type |  min(freq)
----------+-----------
 Domestic |         48
  Foreign |         26
----------------------

Comment B (especially for (budding) programmers)
=========

In other circumstances, I would be the
first to squawk at code like

. egen sumw = sum(mpg) , by(foreign)
. egen meanw = mean(mpg)
. gen freq = sumw/meanw

if within a program, as it is wasteful of
memory and slow. In a program, you
shouldn't use -egen- at all.
In any case, putting a constant
in a variable is bad style.
The code above is used to make it
as clear as possible what is
being done. For efficiency,
we could first go

. egen freq = sum(mpg), by(foreign)
. su mpg, meanonly
. replace freq = freq / r(mean)

and then get rid of the -egen-
(which makes the code longer,
but faster).

Comment C
=========

If you were doing this a lot,
as a convenience you might like
a single function to calculate
the weighted frequencies. This
doesn't seem to have been done,
so I have written an -egen-
function -wtfreq()- which will
be added to -egenmore- on SSC.

Nick
n.j.cox@durham.ac.uk


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index