Search
   >> Home >> Resources & support >> FAQs >> Match merging when there are duplicate IDs

Why does my merge produce a dataset with too many observations?

Title   Match merging when there are duplicate IDs
Author William Gould, StataCorp
Date January 1996; minor update August 2005

Your problem is most likely caused by having duplicate IDs. Duplicate IDs can cause unexpected results when doing a match merge. Consider the following examples:

Example 1. There are too many observations in the merged dataset

The master dataset has 5 observations, and the using dataset has 8 observations. When you do the merge, every observation has a _merge code of 2 or 3 (every observation in the master dataset was matched), yet the merged dataset contains 9 observations.

Cause: Duplicate observations in the smaller dataset (and perhaps in the larger one, too).

. use junk2  [this is the "using" dataset]
        
. list

     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+

. use junk1, clear  [this is the master dataset]
        
. sort id x

. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  2   1 |
  4. |  2   2 |
  5. |  3   1 |
     +--------+

. merge id using junk2
variable id does not uniquely identify observations in the master data
variable id does not uniquely identify observations in junk2.dta

. list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  1   2   2        3 |
  3. |  2   1   1        3 |
  4. |  2   2   1        3 |
  5. |  3   1   1        3 |
     |---------------------|
  6. |  1   2   3        3 |
  7. |  3   1   2        3 |
  8. |  4   .   1        2 |
  9. |  5   .   1        2 |
     +---------------------+

Examine x and y, and you will see how merge matches up duplicates.

Example 2. More matches than observations in the smallest dataset

There are only 3 observations in your master dataset, yet, when you do the merge, there are 4 observations that have a _merge code of 3 (meaning the observations are in both datasets).

Cause: There are duplicates in the using dataset.

. use junk2, clear  [this is the "using" dataset]
        
. list
        
     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+
          
. use junk1, clear  [this is the master dataset]

. sort id x
        
. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  2   2 |
  3. |  3   3 |
     +--------+

. merge id using junk2 
variable id does not uniquely identify observations in junk2.dta

.  list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  2   2   1        3 |
  3. |  3   3   1        3 |
  4. |  1   1   2        3 |
  5. |  1   1   3        3 |
     |---------------------|
  6. |  3   3   2        3 |
  7. |  4   .   1        2 |
  8. |  5   .   1        2 |
     +---------------------+

Finding duplicate IDs

The duplicates and isid commands help identify and deal with duplicate observations.
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube