The following material is based on a question and answer that appeared on
Statalist.

Title | Number of distinct observations | |

Author |
Nicholas J. Cox, Durham University, UK Gary Longton, Fred Hutchinson Cancer Research Center |

Observations are distinct on a variable list if they differ with respect to
that variable list. First, be aware that
**codebook** reports their number, albeit as “unique
values”. This command may be sufficient for your needs.

Alternatively,
**contract** will reduce the dataset to distinct observations and their
frequencies. Using **contract** destroys the existing dataset, however,
and therefore will be inefficient whenever you wish to continue using the
present dataset, which is likely in most problems.

Unique observations are also often interpreted to mean those that occur precisely once in the data. Thus, if values of a variable are 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, and 5, then in one sense of “unique”, there are five distinct or unique values—namely, 1, 2, 3, 4, and 5—whereas in another sense there is just one unique value—namely, 1, the only value that occurs precisely once. To put it another way, is uniqueness a property of the input (each value occurs once in the original) or of the output (each value occurs once in the result, because duplicates have been set aside)?

Suppose, however, that we need to calculate the number of distinct observations for ourselves. To do this, we need first to sort the data into groups of distinct observations and then to count those groups.

This process requires an understanding of the use of the
**by** construct and the facts that under **by**
*varlist***:**, **_N** is interpreted as the number of observations
within each distinct group defined by *varlist* and **_n** is defined
as the observation number, again within each distinct group. Several manual
sections are helpful here: [U] **11.1.2 by varlist:**, [U] **11.5 by
varlist: construct**, [U] **13.7 Explicit subscripting**, [U] **27.2 The
by construct**, and [D] **by**. *Stata Journal* readers can find a
self-contained tutorial on **by** (Cox 2002).
We also need an understanding that true and false conditions
evaluate numerically to 1 and 0, respectively, which is also explained in the
tutorial just cited and in the
*FAQ: What is true
and false in Stata?*.

Let us illustrate with the simplest example, computing the number of distinct
observations on a single variable, **rep78**, in the auto dataset. First, we
read in that dataset, sort by
that variable, and then we tag the first observation within each distinct
group.

. sysuse auto, clear . by rep78, sort: gen nvals = _n == 1

**nvals** is 1 whenever a value is first in its group and is 0 otherwise.
If this were indeed the problem, we should now just type

. count if nvals

which would return the count, also returned as **r(N)**. To extend this
problem to others, we show a different technique. We calculate the running or
cumulative sum—and thus count the 1s, as the 0s make no
difference—and pick up the last value as our answer:

. replace nvals = sum(nvals) . replace nvals = nvals[_N]

The variable **nvals** now contains the number of distinct observations.
Although we chose to tag the first observation in each group, as was
specified by **_n == 1**, we could have done it also for the last
observation in each group, for which the code is **_n == _N**. In this
problem, the result is the same.

The next complication is that we might want to define distinct
observations of a variable with respect to one or more other variables. With
one more variable (say, **foreign**), we might want to calculate the number
of distinct observations of **rep78** separately for each value of
**foreign**:

. by foreign rep78, sort: gen nvals = _n == 1 . by foreign: replace nvals = sum(nvals) . by foreign: replace nvals = nvals[_N]

If you do this for the auto data, you will find that there are 6 distinct
values of **rep78** when **foreign** is 0 (Domestic
cars)—namely, 1, 2, 3, 4, 5 and . (missing)—and 4 distinct
values of **rep78** when **foreign** is 1 (Foreign cars)—namely,
3, 4, 5, and . (missing). This result can also be shown directly by using
**tabulate, miss**.

Alternatively, we might wish to calculate the number of distinct observations
as defined by the combinations of **foreign** and **rep78**:

. by foreign rep78, sort: gen nvals = _n ==1 . replace nvals = sum(nvals) . replace nvals = nvals[_N]

If you do this for the auto data, you will find that there are 10 distinct
values of **rep78** and **foreign** combined, that is, every possible
pair except the pair 1 and Foreign and the pair 2 and Foreign. This naturally
follows from our previous calculation and can also be shown directly by using
**tabulate, miss**.

There are two further complications not tackled in detail here that are likely
to arise in practice. First, as already exemplified, computations of the kind
above include “missing”—whether **.**, **.a**,
**.b**, ..., **.z** for numeric variables or **""** for string
variables—just like any other value. It is quite possible that you
do not want that, and if so, we need to exclude missing values from the
computation. Second, you might have further restrictions on the computation,
to be specified by **if** or **in**.

Stata users have written various programs in this area, including
**distinct** (G. Longton and N.J. Cox), the **egen** function
**nvals()** (N.J. Cox), and **unique** (M. Hills and T. Brady), which
tackle most or all of the wrinkles mentioned here. These are downloadable from
the SSC archive using **ssc**, specifically by

. ssc inst distinct . ssc inst egenmore(N.B.). ssc inst unique

**distinct** reports on distinct values, **egen, nvals()** computes
their number in a new variable, and **unique** does some of both.

For an extended version of this FAQ, see Cox and Longton (2008).

- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

- Cox, N.J. and G.M. Longton. 2008.
- Distinct observations. Stata Journal 8: 557–568.