|
The following material is based on a question and answer that appeared on
Statalist.
How do I compute the number of distinct observations?
|
Title
|
|
Number of distinct observations
|
|
Author
|
Nicholas J. Cox, Durham University, UK
Gary Longton, Fred Hutchinson Cancer Research Center
|
|
Date
|
February 2002; revised April 2002; minor revisions October 2008
|
Observations are distinct on a variable list if they differ with respect to
that variable list. First, be aware that
codebook reports their number, albeit as “unique
values”. This command may be sufficient for your needs.
Alternatively,
contract will reduce the dataset to distinct observations and their
frequencies. Using contract destroys the existing dataset, however,
and therefore will be inefficient whenever you wish to continue using the
present dataset, which is likely in most problems.
Unique observations are also often interpreted to mean those
that occur precisely once in the data. Thus, if values of a variable are 1, 2,
2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, and 5, then in one sense of
“unique”, there are five distinct or unique values—namely,
1, 2, 3, 4, and 5—whereas in another sense there is just one unique
value—namely, 1, the only value that occurs precisely once. To put it
another way, is uniqueness a property of the input (each value occurs once in
the original) or of the output (each value occurs once in the result, because
duplicates have been set aside)?
Suppose, however, that we need to calculate the number of distinct
observations for ourselves. To do this, we need first to sort the data into
groups of distinct observations and then to count those groups.
This process requires an understanding of the use of the
by construct and the facts that under by
varlist:, _N is interpreted as the number of observations
within each distinct group defined by varlist and _n is defined
as the observation number, again within each distinct group. Several manual
sections are helpful here: [U] 11.1.2 by varlist:, [U] 11.5 by
varlist: construct, [U] 13.7 Explicit subscripting, [U] 27.2 The
by construct, and [D] by. Stata Journal readers can find a
self-contained tutorial on by (Cox 2002).
We also need an understanding that true and false conditions
evaluate numerically to 1 and 0, respectively, which is also explained in the
tutorial just cited and in the FAQ
What is true
and false in Stata?
(http://www.stata.com/support/faqs/data-management/true-and-false/).
Let us illustrate with the simplest example, computing the number of distinct
observations on a single variable using the auto dataset. First, we sort by
that variable, and then we tag the first observation within each distinct
group. Using the variable rep78,
. by rep78, sort: gen nvals = _n == 1
nvals is 1 whenever a value is first in its group and is 0 otherwise.
If this were indeed the problem, we should now just type
. count if nvals
which would return the count, also returned as r(N). To extend this
problem to others, we show a different technique. We calculate the running or
cumulative sum—and thus count the 1s, as the 0s make no
difference—and pick up the last value as our answer:
. replace nvals = sum(nvals)
. replace nvals = nvals[_N]
The variable nvals now contains the number of distinct observations.
Although we chose to tag the first observation in each group, as was
specified by _n == 1, we could have done it also for the last
observation in each group, for which the code is _n == _N. In this
problem, the result is the same.
The next complication is that we might want to define distinct
observations of a variable with respect to one or more other variables. With
one more variable (say, foreign), we might want to calculate the number
of distinct observations of rep78 separately for each value of
foreign:
. by foreign rep78, sort: gen nvals = _n == 1
. by foreign: replace nvals = sum(nvals)
. by foreign: replace nvals = nvals[_N]
If you do this for the auto data, you will find that there are 6 distinct
values of rep78 when foreign is 0 (Domestic
cars)—namely, 1, 2, 3, 4, 5 and . (missing)—and 4 distinct
values of rep78 when foreign is 1 (Foreign cars)—namely,
3, 4, 5, and . (missing). This result can also be shown directly by using
tabulate, miss.
Alternatively, we might wish to calculate the number of distinct observations
as defined by the combinations of foreign and rep78:
. by foreign rep78, sort: gen nvals = _n ==1
. replace nvals = sum(nvals)
. replace nvals = nvals[_N]
If you do this for the auto data, you will find that there are 10 distinct
values of rep78 and foreign combined, that is, every possible
pair except the pair 1 and Foreign and the pair 2 and Foreign. This naturally
follows from our previous calculation and can also be shown directly by using
tabulate, miss.
There are two further complications not tackled in detail here that are likely
to arise in practice. First, as already exemplified, computations of the kind
above include “missing”—whether ., .a,
.b, ..., .z for numeric variables or "" for string
variables—just like any other value. It is quite possible that you
do not want that, and if so, we need to exclude missing values from the
computation. Second, you might have further restrictions on the computation,
to be specified by if or in.
Stata users have written various programs in this area, including
distinct (G. Longton and N.J. Cox), the egen function
nvals() (N.J. Cox), and unique (M. Hills and T. Brady), which
tackle most or all of the wrinkles mentioned here. These are downloadable from
the SSC archive using ssc, specifically by
. ssc inst distinct
. ssc inst egenmore (N.B.)
. ssc inst unique
distinct reports on distinct values, egen, nvals() computes
their number in a new variable, and unique does some of both.
Reference
- Cox, N. J. 2002.
-
Speaking Stata: How to move step by: step.
Stata Journal 2: 86–102.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|