Why does my merge produce a dataset with too many observations?
|
Title
|
|
Match merging when there are duplicate IDs
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
January 1996; minor update August 2005
|
Your problem is most likely caused by having duplicate IDs. Duplicate IDs
can cause unexpected results when doing a match merge. Consider the
following examples:
Example 1. There are too many observations in the merged dataset
The master dataset has 5 observations, and the using dataset has 8
observations. When you do the merge, every observation has a _merge
code of 2 or 3 (every observation in the master dataset was matched), yet
the merged dataset contains 9 observations.
Cause: Duplicate observations in the smaller dataset (and perhaps in the
larger one, too).
. use junk2 [this is the "using" dataset]
. list
+--------+
| id y |
|--------|
1. | 1 1 |
2. | 1 2 |
3. | 1 3 |
4. | 2 1 |
5. | 3 1 |
|--------|
6. | 3 2 |
7. | 4 1 |
8. | 5 1 |
+--------+
. use junk1, clear [this is the master dataset]
. sort id x
. list
+--------+
| id x |
|--------|
1. | 1 1 |
2. | 1 2 |
3. | 2 1 |
4. | 2 2 |
5. | 3 1 |
+--------+
. merge id using junk2
variable id does not uniquely identify observations in the master data
variable id does not uniquely identify observations in junk2.dta
. list
+---------------------+
| id x y _merge |
|---------------------|
1. | 1 1 1 3 |
2. | 1 2 2 3 |
3. | 2 1 1 3 |
4. | 2 2 1 3 |
5. | 3 1 1 3 |
|---------------------|
6. | 1 2 3 3 |
7. | 3 1 2 3 |
8. | 4 . 1 2 |
9. | 5 . 1 2 |
+---------------------+
Examine x and y, and you will see how merge matches up
duplicates.
Example 2. More matches than observations in the smallest dataset
There are only 3 observations in your master dataset, yet, when you do the
merge, there are 4 observations that have a _merge code of 3 (meaning
the observations are in both datasets).
Cause: There are duplicates in the using dataset.
. use junk2, clear [this is the "using" dataset]
. list
+--------+
| id y |
|--------|
1. | 1 1 |
2. | 1 2 |
3. | 1 3 |
4. | 2 1 |
5. | 3 1 |
|--------|
6. | 3 2 |
7. | 4 1 |
8. | 5 1 |
+--------+
. use junk1, clear [this is the master dataset]
. sort id x
. list
+--------+
| id x |
|--------|
1. | 1 1 |
2. | 2 2 |
3. | 3 3 |
+--------+
. merge id using junk2
variable id does not uniquely identify observations in junk2.dta
. list
+---------------------+
| id x y _merge |
|---------------------|
1. | 1 1 1 3 |
2. | 2 2 1 3 |
3. | 3 3 1 3 |
4. | 1 1 2 3 |
5. | 1 1 3 3 |
|---------------------|
6. | 3 3 2 3 |
7. | 4 . 1 2 |
8. | 5 . 1 2 |
+---------------------+
Finding duplicate IDs
The
duplicates and
isid commands
help identify and deal with duplicate observations.
|