Search
   >> Home >> Resources & support >> FAQs >> Identifying and dropping duplicate observations
Note: This FAQ is relevant for users of releases prior to Stata 8.

How do I identify duplicate observations in my data?

Title   Identifying and dropping duplicate observations
Author William Gould, StataCorp
Date October 1997, September 1999

Starting with Stata 8, the duplicates command provides a way to report on, give examples of, list, browse, tag, or drop duplicate observations. This FAQ is likely only of interest to users of previous versions of Stata.

Case 1: Identifying duplicates based on a subset of variables

You wish to create a new variable named dup

         dup = 0       record is unique
         dup = 1       record is duplicate, first occurrence
         dup = 2       record is duplicate, second occurrence
         dup = 3       record is duplicate, third occurrence
         etc.

and to base the determination on the variables name, age, and sex.

        . sort name age sex
        . quietly by name age sex:  gen dup = cond(_N==1,0,_n)

Note the capitalization of _N and _n. (Stata interprets _N to mean the total number of observations in the by-group and _n to be the observation number within the by-group.)

Having created the new variable dup, you could then

        . tabulate dup

to see a report of the duplicate count.

To base the duplicate count solely on name, type

        . sort name
        . quietly by name:  gen dup = cond(_N==1,0,_n)

To base the duplicate count on name, age, sex, and address, type

        . sort name age sex address
        . quietly by name age sex address: gen dup = cond(_N==1,0,_n)

Case 2: Dropping duplicates based on a subset of variables

Picking up where case 1 left off, if you want to drop all duplicate observations but keep the first occurrence, type

        . drop if dup>1

To drop all duplicate observations, including the first occurrence, type

        . drop if dup>0

Case 3: Identifying duplicates based on all the variables

Obviously, one solution would be to treat this as case 1, and simply type the names of all the variables in your dataset. The following is a variation on that idea that avoids some typing:

        . unab vlist : _all
        . sort `vlist'
        . quietly by `vlist':  gen dup = cond(_N==1,0,_n)

The use of unab is a trick; it puts the names of every variable in the dataset in the macro vlist.

In the last two lines, note that you type `vlist', which is to say, left single quote, the word vlist, right single quote. Putting open and close quotes around a macro's name is equivalent to typing out the contents of the macro. Thus sort `vlist' is equivalent to typing out sort followed by the names of all the variables in the dataset.

Warning:  You type left single quote (`), the word vlist, and finally a right single quote ('): `vlist'. The left-single-quote character may be difficult to see on your browser, but nothing else will do. If you type quote-vlist-quote ('vlist') Stata will complain. On U.S. keyboards, the left single quote is the top-left key, the same key as the tilde (~) symbol.

Case 4: Dropping duplicates based on all the variables

Picking up where case 3 left off, this is no different from case 2. If you want to drop all duplicate observations but keep the first occurrence, type

        . drop if dup>1

If you want to drop all duplicate observations, including the first occurrence, type

        . drop if dup>0

Example 1

We have the following data:

        . list 
                   make     price       mpg  
          1.  VW Diesel      5397        41  
          2.   BMW 320i      9735        25  
          3. Datsun 510      5079        24  
          4.  Audi 5000      9690        17  
          5.   BMW 320i      9375        25  
          6.  VW Diesel      5397        41  
          7.   BMW 320i      9735        25  

We wish to eliminate duplicate observations based on make:

        . sort make 

        . quietly by make: gen dup = cond(_N==1,0,_n)

The final step is drop if dup>1 but, before we do that, we will list the data so that you can see the values in new variable dup:

        . list 
                   make     price       mpg        dup  
          1.  Audi 5000      9690        17          0  
          2.   BMW 320i      9735        25          1  
          3.   BMW 320i      9735        25          2  
          4.   BMW 320i      9375        25          3  
          5. Datsun 510      5079        24          0  
          6.  VW Diesel      5397        41          1  
          7.  VW Diesel      5397        41          2  
        
        . drop if dup>1
        (3 observations deleted)

        . list
                   make     price       mpg        dup  
          1.  Audi 5000      9690        17          0  
          2.   BMW 320i      9735        25          1  
          3. Datsun 510      5079        24          0  
          4.  VW Diesel      5397        41          1  

Example 2

We start with the data used in example 1, but this time we drop duplicates based on all the variables:

        . list
                   make     price       mpg  
          1.  VW Diesel      5397        41  
          2.   BMW 320i      9735        25  
          3. Datsun 510      5079        24  
          4.  Audi 5000      9690        17  
          5.   BMW 320i      9375        25  
          6.  VW Diesel      5397        41  
          7.   BMW 320i      9735        25  
        
        . local 0

        . unab vlist : _all

        . quietly by `vlist': gen dup = cond(_N==1,0,_n)

        . drop if dup>1 
        (2 observations deleted)

        . list

                   make     price       mpg        dup  
          1.  Audi 5000      9690        17          0  
          2.   BMW 320i      9375        25          0  
          3.   BMW 320i      9735        25          1  
          4. Datsun 510      5079        24          0  
          5.  VW Diesel      5397        41          1  

We obtained a different result: This time, the BMW appears twice in our data, once with a price of 9375 and once with price 9735. Presumably, one of those prices contains a transcription error. When we based the duplicate identification solely on the contents of variable make, we obtained one of the BMW observations at random.

An alternative method for identifying duplicates is discussed in the Technical Note in [D] egen.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube