The following material grew out of postings to
Statalist.
I want to calculate a variable containing weighted group summary
statistics, but I do not want to collapse the data and egen does not support
weights. How can I do this?
|
Title
|
|
Calculating variables containing weighted group summary statistics
|
|
Author
|
Nicholas J. Cox, Durham University, UK
Stephen P. Jenkins, University of Essex, UK
|
|
Date
|
January 2003; minor revisions April 2003, August 2005
|
1. The problem
You have a response variable response, a weights variable
weight, and a group variable group. You want a new variable
containing some weighted summary statistic based on response and
weight for each distinct group. However, you do not want to
collapse the
data, because you wish to maintain your existing data structure, and, although
egen allows the
calculation of many group statistics, it does not support weights. You need
another solution.
2. An example solution
Suppose that you want weighted medians. One way to get them is to loop over
the distinct values of group, calculating the medians one by one.
For this, we first initialize a variable:
. gen wtmedian = .
Using missing values as initial values is arbitrary. For what follows, any
other numeric value would be equally satisfactory. More importantly if
you know that your results may be extremely large or extremely small, you
should specify that the new variable wtmedian should be, say,
double. For more information, see help
data types.
In a simple situation, the values of group could be, for example,
consecutive integers. Here a loop controlled by
forvalues is
easiest. Below is the whole structure, which we will explain step by step.
. quietly forvalues i = 1/50 {
. summarize response [w=weight] if group == `i', detail
. replace wtmedian = r(p50) if group == `i'
. }
forvalues, which is usually abbreviated forval, cycles over a
range of integers, here taken to be consecutive integers from 1 to 50. The
loop is controlled by a so-called local macro, here named i. Within
the loop, the macro is referred as `i'. For a tutorial with much
more by way of explanation and examples, see Cox (2002).
The first time around the loop, i is set to 1, and Stata
summarizes the response using weight for observations
with values of group equal to 1. The manual entry for [R]
summarize tells us that to calculate medians we need to specify the
detail option and that the median is left behind in memory in
r(p50). This result is used to overwrite the initial values of
wtmedian for which group is equal to 1. Second time around the
loop, we do this for values of 2, and so forth. As we are repeatedly
operating on subsets of a variable, we must use replace. We cannot
use generate repeatedly, which is why we used generate just
once, before the loop was entered. A small but important detail is that the
whole is done
quietly to avoid
repeated output from each command on your monitor (and in any log files).
However, if you are debugging a loop that does not seem to be correct, you
may find it useful to omit the quietly while fixing the code.
3. Extensions
Various extensions spring to mind. What about other summary statistics? You
just need to add a line before the loop initializing a variable for each
statistic and a line within the loop updating that variable. Suppose we
also wanted lower and upper quartiles. Here is the code:
. gen wtmedian = .
. gen wtloq = .
. gen wtupq = .
. qui forvalues i = 1/50 {
. summarize response [w=weight] if group == `i', detail
. replace wtmedian = r(p50) if group == `i'
. replace wtloq = r(p25) if group == `i'
. replace wtupq = r(p75) if group == `i'
. }
. gen wtiqr = wtupq - wtloq
As a bonus, we threw in an interquartile range (iqr) calculation. A moment's
thought shows that the iqr can be calculated outside the loop, as knowing
the quartiles is sufficient. More generally, whatever can be moved out of
the loop should be, as there is an overhead in using if. The
documentation in the manual shows what is available after summarize.
At worst, if the manual is not at hand, look at the results using return
list after an r-class command (ereturn list after an e-class
command).
We have been leaning heavily on the assumption of a simple structure for
group, namely values taking successive integers. For more complicated
structures, various strategies are available, some of which are documented
for a similar problem in the FAQ: "Is there a way to tell Stata to try all
values of a particular variable in a for statement without specifying them?"
at
http://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/. One that is perhaps
simplest to use is
levelsof. (For Stata 8 users, the corresponding command is
levels, which was added 16 April 2003.) levelsof can be used
to return a list of the distinct values of a categorical variable.
Let us revisit weighted medians using levelsof, assuming that
group contains integers. A structure more general than forvalues is
foreach, which
supports cycling through the elements of any clearly defined list.
foreach is discussed at length in the tutorial mentioned earlier in
Cox (2002).
In this example, the list is contained with the local macro named
levels, which we instruct the levelsof command to produce. We
do not need to spell out the elements of the list or even to know what they
are.
. gen wtmedian = .
. levelsof group, local(levels)
. qui foreach l of local levels {
. summarize response [w=weight] if group == `l', detail
. replace wtmedian = r(p50) if group == `l'
. }
As said, this assumes that levelsof is working on a variable
group containing integers. If group were in fact string, the
only change to this code is that the tests should be if group ==
"`l'" rather than if group == `l'.
Finally, we have specified weights using [w=weight], which applies
them in whatever manner is default for the command. To apply them in some
other way, you will need to be explicit about how they are to be used by
specifying [aw=weight] or [fw=weight] or [iw=weight] or
[pw=weight].
Reference
- Cox, N. J. 2002.
-
Speaking Stata: How to face lists with fortitude.
Stata Journal. 2: 202–222.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|