Title | Calculating variables containing weighted group summary statistics | |
Author |
Nicholas J. Cox, Durham University, UK Stephen P. Jenkins, London School of Economics, UK |
You have a response variable response, a weights variable weight, and a group variable group. You want a new variable containing some weighted summary statistic based on response and weight for each distinct group. However, you do not want to collapse the data, because you wish to maintain your existing data structure, and, although egen allows the calculation of many group statistics, it does not support weights. You need another solution.
Suppose that you want weighted medians. One way to get them is to loop over the distinct values of group, calculating the medians one by one. For this, we first initialize a variable:
. gen wtmedian = .
Using missing values as initial values is arbitrary. For what follows, any other numeric value would be equally satisfactory. More importantly if you know that your results may be extremely large or extremely small, you should specify that the new variable wtmedian should be, say, double. For more information, see [D] data types.
In a simple situation, the values of group could be, for example, consecutive integers. Here a loop controlled by forvalues is easiest. Below is the whole structure, which we will explain step by step.
. quietly forvalues i = 1/50 { . summarize response [w=weight] if group == `i', detail . replace wtmedian = r(p50) if group == `i' . }
forvalues, which is usually abbreviated forval, cycles over a range of integers, here taken to be consecutive integers from 1 to 50. The loop is controlled by a so-called local macro, here named i. Within the loop, the macro is referred as `i'. For a tutorial with much more by way of explanation and examples, see Cox (2002).
The first time around the loop, i is set to 1, and Stata summarizes the response using weight for observations with values of group equal to 1. The manual entry for [R] summarize tells us that to calculate medians we need to specify the detail option and that the median is left behind in memory in r(p50). This result is used to overwrite the initial values of wtmedian for which group is equal to 1. Second time around the loop, we do this for values of 2, and so forth. As we are repeatedly operating on subsets of a variable, we must use replace. We cannot use generate repeatedly, which is why we used generate just once, before the loop was entered. A small but important detail is that the whole is done quietly to avoid repeated output from each command on your monitor (and in any log files). However, if you are debugging a loop that does not seem to be correct, you may find it useful to omit the quietly while fixing the code.
Various extensions spring to mind. What about other summary statistics? You just need to add a line before the loop initializing a variable for each statistic and a line within the loop updating that variable. Suppose we also wanted lower and upper quartiles. Here is the code:
. gen wtmedian = . . gen wtloq = . . gen wtupq = . . qui forvalues i = 1/50 { . summarize response [w=weight] if group == `i', detail . replace wtmedian = r(p50) if group == `i' . replace wtloq = r(p25) if group == `i' . replace wtupq = r(p75) if group == `i' . } . gen wtiqr = wtupq - wtloq
As a bonus, we threw in an interquartile range (iqr) calculation. A moment's thought shows that the iqr can be calculated outside the loop, as knowing the quartiles is sufficient. More generally, whatever can be moved out of the loop should be, as there is an overhead in using if. The documentation in the manual shows what is available after summarize. At worst, if the manual is not at hand, look at the results using return list after an r-class command (ereturn list after an e-class command).
We have been leaning heavily on the assumption of a simple structure for group, namely values taking successive integers. For more complicated structures, various strategies are available, some of which are documented for a similar problem in the FAQ: "Is there a way to tell Stata to try all values of a particular variable in a for statement without specifying them?". One that is perhaps simplest to use is levelsof. (For Stata 8 users, the corresponding command is levels, which was added 16 April 2003.) levelsof can be used to return a list of the distinct values of a categorical variable.
Let us revisit weighted medians using levelsof, assuming that group contains integers. A structure more general than forvalues is foreach, which supports cycling through the elements of any clearly defined list. foreach is discussed at length in the tutorial mentioned earlier in Cox (2002).
In this example, the list is contained with the local macro named levels, which we instruct the levelsof command to produce. We do not need to spell out the elements of the list or even to know what they are.
. gen wtmedian = . . levelsof group, local(levels) . qui foreach l of local levels { . summarize response [w=weight] if group == `l', detail . replace wtmedian = r(p50) if group == `l' . }
As said, this assumes that levelsof is working on a variable group containing integers. If group were in fact string, the only change to this code is that the tests should be if group == "`l'" rather than if group == `l'.
Finally, we have specified weights using [w=weight], which applies them in whatever manner is default for the command. To apply them in some other way, you will need to be explicit about how they are to be used by specifying [aw=weight] or [fw=weight] or [iw=weight] or [pw=weight]. See [U] 11.1.6 weight for more details.