The following material is based on postings on
Statalist.
How do I calculate the number of distinct values seen so far?
|
Title
|
|
Calculating the number of distinct values
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2006
|
The problem
I have data collected in sequence like this:
. list
+------+
| x |
|------|
1. | cd1 |
2. | cd2 |
3. | cd2 |
4. | cd3 |
5. | cd1 |
|------|
6. | cd3 |
7. | cd4 |
8. | cd1 |
9. | cd5 |
10. | cd3 |
+------+
I want to keep track of the number of distinct values seen so far in the
sequence. This number increases from 1 at observation 1 (cd1 first
occurs), to 2 at observation 2 (cd2 first occurs), to 3 at
observation 4 (cd3 first occurs), and so forth.
The solution
You can do the above by using by:,
which is one of the most versatile features of Stata.
One clue to by: being useful here is the structure of a grouping of the
variable x into several distinct values. All we need to do is tag the
first occurrence of each distinct value, and then count those first
occurrences in sequence.
by: goes hand in hand with sorting. We should keep a record of the
current order of observations, because we will want to return to
it. If the dataset already includes a time, or other identifier indicating
sequence, we can use that. Otherwise, generate a variable recording current
order
. generate order = _n
If your dataset is really big, that should be
. generate long order = _n
We will sort into groups of x and ensure that within those groups the
original order of observations is followed. Then we tag the first
occurrence of each value of x. This process can all be telescoped
into one statement:
. by x (order), sort: generate y = _n == 1
That statement can be thought of as a condensed version of
. sort x order
. by x: gen y = _n == 1
The sort order is first by x and then by order. Then within
groups of x, the first observation is tagged as 1; all others within
the same group are tagged by 0.
Let us take this more slowly: Under by:, the observation number
_n is determined within the groups defined. Thus _n starts
over at 1 each time a new group is encountered. So _n is 1 if
an observation is the first in its group. _n == 1 is true for
all such first observations. Any true or false condition is evaluated
numerically in Stata as 1 if true and 0 if false. For more detail on that
principle, see the FAQ What is
true and false in Stata?.
After that, we need to sort
to the original order. Then we need a running sum of y because the
number of distinct values seen so far is equal to the number of first
occurrences seen so far.
. sort order
. replace y = sum(y)
order has served its purpose.
. drop order
What do we have now?
. list
+----------+
| x y |
|----------|
1. | cd1 1 |
2. | cd2 2 |
3. | cd2 2 |
4. | cd3 3 |
5. | cd1 3 |
|----------|
6. | cd3 3 |
7. | cd4 4 |
8. | cd1 4 |
9. | cd5 5 |
10. | cd3 5 |
+----------+
With a little more knowledge, we could wrap that into a command, or
an egen function,
but, in many ways, it is better to use the code here and understand its logic,
which will help for that next problem with a similar flavor.
The key construct here is by:. The documentation for by: is
scattered around the manuals. A tutorial bringing together the main ideas is
given in Cox (2002), which explains the use of the construct to tackle a
variety of problems with group structure, ranging from simple calculations
for each of several groups to more advanced manipulations that use the
built-in _n and _N.
Reference
-
Cox, N. J. 2002. Speaking Stata: How to move step by: step.
Stata Journal 2: 86–102.
|