The following material is based on a question and answer that appeared on Statalist.
|Title||Number of distinct observations|
Nicholas J. Cox, Durham University, UK
Gary Longton, Fred Hutchinson Cancer Research Center
|Date||February 2002; revised April 2002; minor revisions October 2008|
Observations are distinct on a variable list if they differ with respect to that variable list. First, be aware that codebook reports their number, albeit as “unique values”. This command may be sufficient for your needs.
Alternatively, contract will reduce the dataset to distinct observations and their frequencies. Using contract destroys the existing dataset, however, and therefore will be inefficient whenever you wish to continue using the present dataset, which is likely in most problems.
Unique observations are also often interpreted to mean those that occur precisely once in the data. Thus, if values of a variable are 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, and 5, then in one sense of “unique”, there are five distinct or unique values—namely, 1, 2, 3, 4, and 5—whereas in another sense there is just one unique value—namely, 1, the only value that occurs precisely once. To put it another way, is uniqueness a property of the input (each value occurs once in the original) or of the output (each value occurs once in the result, because duplicates have been set aside)?
Suppose, however, that we need to calculate the number of distinct observations for ourselves. To do this, we need first to sort the data into groups of distinct observations and then to count those groups.
This process requires an understanding of the use of the by construct and the facts that under by varlist:, _N is interpreted as the number of observations within each distinct group defined by varlist and _n is defined as the observation number, again within each distinct group. Several manual sections are helpful here: [U] 11.1.2 by varlist:, [U] 11.5 by varlist: construct, [U] 13.7 Explicit subscripting, [U] 27.2 The by construct, and [D] by. Stata Journal readers can find a self-contained tutorial on by (Cox 2002). We also need an understanding that true and false conditions evaluate numerically to 1 and 0, respectively, which is also explained in the tutorial just cited and in the FAQ: What is true and false in Stata?.
Let us illustrate with the simplest example, computing the number of distinct observations on a single variable using the auto dataset. First, we sort by that variable, and then we tag the first observation within each distinct group. Using the variable rep78,
. by rep78, sort: gen nvals = _n == 1
nvals is 1 whenever a value is first in its group and is 0 otherwise. If this were indeed the problem, we should now just type
. count if nvals
which would return the count, also returned as r(N). To extend this problem to others, we show a different technique. We calculate the running or cumulative sum—and thus count the 1s, as the 0s make no difference—and pick up the last value as our answer:
. replace nvals = sum(nvals) . replace nvals = nvals[_N]
The variable nvals now contains the number of distinct observations. Although we chose to tag the first observation in each group, as was specified by _n == 1, we could have done it also for the last observation in each group, for which the code is _n == _N. In this problem, the result is the same.
The next complication is that we might want to define distinct observations of a variable with respect to one or more other variables. With one more variable (say, foreign), we might want to calculate the number of distinct observations of rep78 separately for each value of foreign:
. by foreign rep78, sort: gen nvals = _n == 1 . by foreign: replace nvals = sum(nvals) . by foreign: replace nvals = nvals[_N]
If you do this for the auto data, you will find that there are 6 distinct values of rep78 when foreign is 0 (Domestic cars)—namely, 1, 2, 3, 4, 5 and . (missing)—and 4 distinct values of rep78 when foreign is 1 (Foreign cars)—namely, 3, 4, 5, and . (missing). This result can also be shown directly by using tabulate, miss.
Alternatively, we might wish to calculate the number of distinct observations as defined by the combinations of foreign and rep78:
. by foreign rep78, sort: gen nvals = _n ==1 . replace nvals = sum(nvals) . replace nvals = nvals[_N]
If you do this for the auto data, you will find that there are 10 distinct values of rep78 and foreign combined, that is, every possible pair except the pair 1 and Foreign and the pair 2 and Foreign. This naturally follows from our previous calculation and can also be shown directly by using tabulate, miss.
There are two further complications not tackled in detail here that are likely to arise in practice. First, as already exemplified, computations of the kind above include “missing”—whether ., .a, .b, ..., .z for numeric variables or "" for string variables—just like any other value. It is quite possible that you do not want that, and if so, we need to exclude missing values from the computation. Second, you might have further restrictions on the computation, to be specified by if or in.
Stata users have written various programs in this area, including distinct (G. Longton and N.J. Cox), the egen function nvals() (N.J. Cox), and unique (M. Hills and T. Brady), which tackle most or all of the wrinkles mentioned here. These are downloadable from the SSC archive using ssc, specifically by
. ssc inst distinct . ssc inst egenmore (N.B.) . ssc inst unique
distinct reports on distinct values, egen, nvals() computes their number in a new variable, and unique does some of both.