Title | Calculating the number of distinct values | |

Author | Nicholas J. Cox, Durham University, UK | |

Date | September 2006; minor revision February 2014 |

I have data collected in sequence like this:

. list+------+ | x | |------| 1. | cd1 | 2. | cd2 | 3. | cd2 | 4. | cd3 | 5. | cd1 | |------| 6. | cd3 | 7. | cd4 | 8. | cd1 | 9. | cd5 | 10. | cd3 | +------+

I want to keep track of the number of distinct values seen so far in the
sequence. This number increases from 1 at observation 1 (**cd1** first
occurs), to 2 at observation 2 (**cd2** first occurs), to 3 at
observation 4 (**cd3** first occurs), and so forth.

You can do the above by using **by:**,
which is one of the most versatile features of Stata.

One clue to **by:** being useful here is the structure of a grouping of the
variable **x** into several distinct values. All we need to do is tag the
first occurrence of each distinct value, and then count those first
occurrences in sequence.

**by:** goes hand in hand with sorting. We should keep a record of the
current order of observations, because we will want to return to
it. If the dataset already includes a time, or other identifier indicating
sequence, we can use that. Otherwise, generate a variable recording current
order

. generate order = _n

If your dataset is really big, that should be

. generate long order = _n

We will sort into groups of **x** and ensure that within those groups the
original order of observations is followed. Then we tag the first
occurrence of each value of **x**. This process can all be telescoped
into one statement:

. by x (order), sort: generate y = _n == 1

That statement can be thought of as a condensed version of

. sort x order . by x: gen y = _n == 1

The sort order is first by **x** and then by **order**. Then within
groups of **x**, the first observation is tagged as 1; all others within
the same group are tagged by 0.

Let us take this more slowly: Under **by:**, the observation number
**_n** is determined within the groups defined. Thus **_n** starts
over at 1 each time a new group is encountered. So **_n** is 1 if
an observation is the first in its group. **_n == 1** is true for
all such first observations. Any true or false condition is evaluated
numerically in Stata as 1 if true and 0 if false. For more detail on that
principle, see the FAQ: What is
true and false in Stata?.

After that, we need to **sort**
to the original order. Then we need a running sum of **y** because the
number of distinct values seen so far is equal to the number of first
occurrences seen so far.

. sort order . replace y = sum(y)

**order** has served its purpose.

. drop order

What do we have now?

. list+----------+ | x y | |----------| 1. | cd1 1 | 2. | cd2 2 | 3. | cd2 2 | 4. | cd3 3 | 5. | cd1 3 | |----------| 6. | cd3 3 | 7. | cd4 4 | 8. | cd1 4 | 9. | cd5 5 | 10. | cd3 5 | +----------+

With a little more knowledge, we could wrap that into a command, or
an **egen** function,
but, in many ways, it is better to use the code here and understand its logic,
which will help for that next problem with a similar flavor.

The key construct here is **by:**. The documentation for **by:** is
scattered around the manuals. A tutorial bringing together the main ideas is
given in Cox (2002), which explains the use of the construct to tackle a
variety of problems with group structure, ranging from simple calculations
for each of several groups to more advanced manipulations that use the
built-in **_n** and **_N**.

- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step.
*Stata Journal*2: 86–102.

- Cox, N. J. and G. M. Longton. 2008.
- Distinct observations.
*Stata Journal*8: 557–568.