|
The following material is based on an exchange on
Statalist.
How do I tabulate cumulative frequencies?
|
Title
|
|
Tabulating cumulative frequencies
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
February 2002; minor revisions September 2002
|
Question:
I have a variable that I can tabulate to get the frequencies and
percent and cumulative percent frequencies of its different values, but what
I really want to see are the cumulative frequencies. Any suggestions?
Answer:
The question shows that no matter how many general or special purpose
tabulation commands Stata provides, it is always possible to think of a
tabulation problem that is not directly supported.
Having said that, some general advice for a tabulation problem
is to look carefully at the various options provided by
tabulate,
tabulate, summarize,
table, and
tabstat.
You should be familiar with specialty tabulation
commands for your area, such as those discussed in
epitab,
ltable,
svy: tabulate, and
xttab.
You can use findit
to obtain information on more tabulation commands and programs, which will
include user-written material:
. findit table
One solution to this specific question, however, is to generate your own
variables for display with
tabdisp. The
purpose of the rest of this FAQ is to show that although tabdisp is
often billed as a programmer's command, which is shown by the placement of the
manual entry in the Programming Reference Manual, the command can be
used without Stata programming knowledge.
First, we need to calculate the cumulative frequencies to tabulate.
To do this, we need an understanding of the use of the
by construct and the
facts that under by varlist:, _N is interpreted
as the number of observations within each distinct group defined by
varlist and _n as the observation number, again within each
distinct group. Several manual sections are helpful here: [U] 11.1.2 by
varlist:, [U] 11.5 by varlist: construct, [U] 13.7 Explicit
subscripting, [U] 27.2 The by construct, and [D] by.
Stata Journal readers can also find a tutorial on by (Cox
2002). Let us illustrate
with a simple example from the auto dataset. First, we sort by the variable
of interest and get the frequencies:
. sysuse auto, clear
. by rep78, sort: gen freq = _N
Then we get the cumulative frequencies:
. by rep78: gen cumfreq = _N if _n == 1
. replace cumfreq = sum(cumfreq)
The trick here is to put the frequencies once more into a variable but only
once within each group. We chose to do this for the first observation in
each group; that is what is meant by _n == 1. In many other
problems, we could do it also for the last observation in each group, but
that is not good in this problem. The cumulative sum produced by the
sum() function treats all the missing values produced by the previous
command as 0, which is precisely what we want. That is, the cumulative
frequency is, as its definition requires, the cumulative sum of just one
group frequency from each group. Now we can just call up tabdisp:
. tabdisp rep78, cell(freq cumfreq)
----------------------------------
Repair |
Record |
1978 | freq cumfreq
----------+-----------------------
1 | 2 2
2 | 8 10
3 | 30 40
4 | 18 58
5 | 11 69
. | 5 74
----------------------------------
The table could be improved by adding variable labels.
This approach works fine whenever, as here, all the observations of each
category of the row variable (rep78) have the same values of the
variables shown in each cell. For example, if we looked at the values of
freq and cumfreq for each observation for which rep78
is 1, they will be identical. What tabdisp does is to show the values
of the cell variables for the first observation it finds within each
category. However, because, by construction, all the values of freq and
cumfreq are constant within each category of rep78,
tabdisp will find and show the values we want to be shown.
If this is a recurrent need, you might want to go further and wrap these
commands in a program of your own. Many complications can be accommodated,
but, just to give a sample, here is a simple program that goes a few steps
further:
program def mytab, sortpreserve
version 10
syntax varname [if] [in]
marksample touse, strok
tempvar freq cumfreq
bysort `touse' `varlist': gen long `freq' = _N
by `touse' `varlist' : gen long `cumfreq' = _N * (_n == 1)
qui by `touse' : replace `cumfreq' = sum(`cumfreq')
label var `freq' "Frequency"
label var `cumfreq' "Cumulative frequency"
tabdisp `varlist' if `touse', c(`freq' `cumfreq')
end
Among other details, this program allows use of if and in
restrictions, the columns will be more nicely labeled, and the frequencies
and cumulative frequencies will be put into temporary variables that
disappear once the program has finished.
Reference
- Cox, N. J. 2002.
-
Speaking Stata: How to move step by: step
Stata Journal 2: 86–102.
|