Stata | FAQ: Match merging when there are duplicate IDs

Home / Resources & support / FAQs / Match merging when there are duplicate IDs

Why does my merge produce a dataset with too many observations?

Title		Match merging when there are duplicate IDs
Author		William Gould, StataCorp

Your problem is most likely caused by having duplicate IDs. Duplicate IDs can cause unexpected results when doing a match merge. Consider the following examples:

Example 1. There are too many observations in the merged dataset

The master dataset has 5 observations, and the using dataset has 8 observations. When you do the merge, every observation has a _merge code of 2 or 3 (every observation in the master dataset was matched), yet the merged dataset contains 9 observations.

Cause: Duplicate observations in the smaller dataset (and perhaps in the larger one, too).

. use junk2  [this is the "using" dataset]
        
. list

     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+

. use junk1, clear  [this is the master dataset]
        
. sort id x

. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  2   1 |
  4. |  2   2 |
  5. |  3   1 |
     +--------+

. merge id using junk2
variable id does not uniquely identify observations in the master data
variable id does not uniquely identify observations in junk2.dta

. list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  1   2   2        3 |
  3. |  2   1   1        3 |
  4. |  2   2   1        3 |
  5. |  3   1   1        3 |
     |---------------------|
  6. |  1   2   3        3 |
  7. |  3   1   2        3 |
  8. |  4   .   1        2 |
  9. |  5   .   1        2 |
     +---------------------+

Examine x and y, and you will see how merge matches up duplicates.

Example 2. More matches than observations in the smallest dataset

There are only 3 observations in your master dataset, yet, when you do the merge, there are 4 observations that have a _merge code of 3 (meaning the observations are in both datasets).

Cause: There are duplicates in the using dataset.

. use junk2, clear  [this is the "using" dataset]
        
. list
        
     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+
          
. use junk1, clear  [this is the master dataset]

. sort id x
        
. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  2   2 |
  3. |  3   3 |
     +--------+

. merge id using junk2 
variable id does not uniquely identify observations in junk2.dta

.  list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  2   2   1        3 |
  3. |  3   3   1        3 |
  4. |  1   1   2        3 |
  5. |  1   1   3        3 |
     |---------------------|
  6. |  3   3   2        3 |
  7. |  4   .   1        2 |
  8. |  5   .   1        2 |
     +---------------------+

Finding duplicate IDs

The duplicates and isid commands help identify and deal with duplicate observations.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Why does my merge produce a dataset with too many observations?

Example 1. There are too many observations in the merged dataset

Example 2. More matches than observations in the smallest dataset

Finding duplicate IDs

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Why does my merge produce a dataset with too many observations?

Example 1. There are too many observations in the merged dataset

Example 2. More matches than observations in the smallest dataset

Finding duplicate IDs

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies