Stata | FAQ: Calculating the number of distinct values

Home / Resources & support / FAQs / Calculating the number of distinct values

The following material is based on postings on Statalist.

How do I calculate the number of distinct values seen so far?

Title		Calculating the number of distinct values
Author		Nicholas J. Cox, Durham University, UK

The problem

I have data collected in sequence like this:

      . list
      
            +------+
            |    x |
            |------|
         1. |  cd1 |
         2. |  cd2 |
         3. |  cd2 |
         4. |  cd3 |
         5. |  cd1 |
            |------|
         6. |  cd3 |
         7. |  cd4 |
         8. |  cd1 |
         9. |  cd5 |
        10. |  cd3 |
            +------+

I want to keep track of the number of distinct values seen so far in the sequence. This number increases from 1 at observation 1 (cd1 first occurs), to 2 at observation 2 (cd2 first occurs), to 3 at observation 4 (cd3 first occurs), and so forth.

The solution

You can do the above by using by:, which is one of the most versatile features of Stata.

One clue to by: being useful here is the structure of a grouping of the variable x into several distinct values. All we need to do is tag the first occurrence of each distinct value, and then count those first occurrences in sequence.

by: goes hand in hand with sorting. We should keep a record of the current order of observations, because we will want to return to it. If the dataset already includes a time, or other identifier indicating sequence, we can use that. Otherwise, generate a variable recording current order

      . generate order = _n

If your dataset is really big, that should be

      . generate long order = _n

We will sort into groups of x and ensure that within those groups the original order of observations is followed. Then we tag the first occurrence of each value of x. This process can all be telescoped into one statement:

      . by x (order), sort: generate y = _n == 1

That statement can be thought of as a condensed version of

      . sort x order 
      . by x: gen y = _n == 1

The sort order is first by x and then by order. Then within groups of x, the first observation is tagged as 1; all others within the same group are tagged by 0.

Let us take this more slowly: Under by:, the observation number _n is determined within the groups defined. Thus _n starts over at 1 each time a new group is encountered. So _n is 1 if an observation is the first in its group. _n == 1 is true for all such first observations. Any true or false condition is evaluated numerically in Stata as 1 if true and 0 if false. For more detail on that principle, see the FAQ: What is true and false in Stata?.

After that, we need to sort to the original order. Then we need a running sum of y because the number of distinct values seen so far is equal to the number of first occurrences seen so far.

      . sort order
      . replace y = sum(y)

order has served its purpose.

      . drop order

What do we have now?

      . list

            +----------+
            |    x   y |
            |----------|
         1. |  cd1   1 |
         2. |  cd2   2 |
         3. |  cd2   2 |
         4. |  cd3   3 |
         5. |  cd1   3 |
            |----------|
         6. |  cd3   3 |
         7. |  cd4   4 |
         8. |  cd1   4 |
         9. |  cd5   5 |
        10. |  cd3   5 |
            +----------+

With a little more knowledge, we could wrap that into a command, or an egen function, but, in many ways, it is better to use the code here and understand its logic, which will help for that next problem with a similar flavor.

The key construct here is by:. The documentation for by: is scattered around the manuals. A tutorial bringing together the main ideas is given in Cox (2002), which explains the use of the construct to tackle a variety of problems with group structure, ranging from simple calculations for each of several groups to more advanced manipulations that use the built-in _n and _N.

Reference

Cox, N. J. 2002.: Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

Cox, N. J. and G. M. Longton. 2008.: Distinct observations. Stata Journal 8: 557–568.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How do I calculate the number of distinct values seen so far?

The problem

The solution

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How do I calculate the number of distinct values seen so far?

The problem

The solution

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies