The following material is based on an exchange on Statalist.
Title | Tabulating cumulative frequencies | |
Author | Nicholas J. Cox, Durham University, UK | |
Date | February 2002; minor revisions September 2002 |
I have a variable that I can tabulate to get the frequencies and percent and cumulative percent frequencies of its different values, but what I really want to see are the cumulative frequencies. Any suggestions?
The question shows that no matter how many general or special purpose tabulation commands Stata provides, it is always possible to think of a tabulation problem that is not directly supported.
Having said that, some general advice for a tabulation problem is to look carefully at the various options provided by tabulate (oneway or twoway), tabulate, summarize, table, and tabstat. You should be familiar with specialty tabulation commands for your area, such as those discussed in epitab, ltable, svy: tabulate (oneway or twoway), and xttab.
You can use findit to obtain information on more tabulation commands and programs, which will include user-written material:
. findit table
One solution to this specific question, however, is to generate your own variables for display with tabdisp. The purpose of the rest of this FAQ is to show that although tabdisp is often billed as a programmer's command, which is shown by the placement of the manual entry in the Programming Reference Manual, the command can be used without Stata programming knowledge.
First, we need to calculate the cumulative frequencies to tabulate. To do this, we need an understanding of the use of the by construct and the facts that under by varlist:, _N is interpreted as the number of observations within each distinct group defined by varlist and _n as the observation number, again within each distinct group. Several manual sections are helpful here: [U] 11.1.2 by varlist:, [U] 11.5 by varlist: construct, [U] 13.7 Explicit subscripting, [U] 27.2 The by construct, and [D] by. Stata Journal readers can also find a tutorial on by (Cox 2002).
Let us illustrate with a simple example from the auto dataset. First, we sort by the variable of interest and get the frequencies:
. sysuse auto, clear . by rep78, sort: gen freq = _N
Then we get the cumulative frequencies:
. by rep78: gen cumfreq = _N if _n == 1 . replace cumfreq = sum(cumfreq)
The trick here is to put the frequencies once more into a variable but only once within each group. We chose to do this for the first observation in each group; that is what is meant by _n == 1. In many other problems, we could do it also for the last observation in each group, but that is not good in this problem. The cumulative sum produced by the sum() function treats all the missing values produced by the previous command as 0, which is precisely what we want. That is, the cumulative frequency is, as its definition requires, the cumulative sum of just one group frequency from each group. Now we can just call up tabdisp:
. tabdisp rep78, cell(freq cumfreq) ---------------------------------- Repair | Record | 1978 | freq cumfreq ----------+----------------------- 1 | 2 2 2 | 8 10 3 | 30 40 4 | 18 58 5 | 11 69 . | 5 74 ----------------------------------
The table could be improved by adding variable labels.
This approach works fine whenever, as here, all the observations of each category of the row variable (rep78) have the same values of the variables shown in each cell. For example, if we looked at the values of freq and cumfreq for each observation for which rep78 is 1, they will be identical. What tabdisp does is to show the values of the cell variables for the first observation it finds within each category. However, because, by construction, all the values of freq and cumfreq are constant within each category of rep78, tabdisp will find and show the values we want to be shown.
If this is a recurrent need, you might want to go further and wrap these commands in a program of your own. Many complications can be accommodated, but, just to give a sample, here is a simple program that goes a few steps further:
program def mytab, sortpreserve version 10 syntax varname [if] [in] marksample touse, strok tempvar freq cumfreq bysort `touse' `varlist': gen long `freq' = _N by `touse' `varlist' : gen long `cumfreq' = _N * (_n == 1) qui by `touse' : replace `cumfreq' = sum(`cumfreq') label var `freq' "Frequency" label var `cumfreq' "Cumulative frequency" tabdisp `varlist' if `touse', c(`freq' `cumfreq') end
Among other details, this program allows use of if and in restrictions, the columns will be more nicely labeled, and the frequencies and cumulative frequencies will be put into temporary variables that disappear once the program has finished.