Note: This FAQ is relevant for users of releases prior to Stata 8.
How do I identify duplicate observations in my data?
|
Title
|
|
Identifying and dropping duplicate observations
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
October 1997, September 1999
|
Starting with Stata 8, the duplicates command provides a
way to report on, give examples of, list, browse, tag, or drop duplicate
observations. This FAQ is likely only of interest to users of previous
versions of Stata.
Case 1: Identifying duplicates based on a subset of variables
You wish to create a new variable named dup
dup = 0 record is unique
dup = 1 record is duplicate, first occurrence
dup = 2 record is duplicate, second occurrence
dup = 3 record is duplicate, third occurrence
etc.
and to base the determination on the variables name, age, and
sex.
. sort name age sex
. quietly by name age sex: gen dup = cond(_N==1,0,_n)
Note the capitalization of _N and _n. (Stata interprets
_N to mean the total number of observations in the by-group and
_n to be the observation number within the by-group.)
Having created the new variable dup, you could then
. tabulate dup
to see a report of the duplicate count.
To base the duplicate count solely on name, type
. sort name
. quietly by name: gen dup = cond(_N==1,0,_n)
To base the duplicate count on name, age, sex, and
address, type
. sort name age sex address
. quietly by name age sex address: gen dup = cond(_N==1,0,_n)
Case 2: Dropping duplicates based on a subset of variables
Picking up where case 1 left off, if you want to drop all duplicate
observations but keep the first occurrence, type
. drop if dup>1
To drop all duplicate observations, including the first occurrence, type
. drop if dup>0
Case 3: Identifying duplicates based on all the variables
Obviously, one solution would be to treat this as case 1, and simply type the
names of all the variables in your dataset. The following is a variation on
that idea that avoids some typing:
. unab vlist : _all
. sort `vlist'
. quietly by `vlist': gen dup = cond(_N==1,0,_n)
The use of unab is a trick; it puts the names of every variable in
the dataset in the macro vlist.
In the last two lines, note that you type `vlist', which is to say,
left single quote, the word vlist, right single quote. Putting open
and close quotes around a macro's name is equivalent to typing out the
contents of the macro. Thus sort `vlist' is equivalent to typing
out sort followed by the names of all the variables in the dataset.
Warning: You type left single
quote (`), the word vlist, and finally a right single quote
('): `vlist'. The left-single-quote character may be difficult
to see on your browser, but nothing else will do. If you type
quote-vlist-quote ('vlist') Stata will complain. On U.S.
keyboards, the left single quote is the top-left key, the same key as the
tilde (~) symbol.
Case 4: Dropping duplicates based on all the variables
Picking up where case 3 left off, this is no different from case 2. If you
want to drop all duplicate observations but keep the first occurrence, type
. drop if dup>1
If you want to drop all duplicate observations, including the first
occurrence, type
. drop if dup>0
Example 1
We have the following data:
. list
make price mpg
1. VW Diesel 5397 41
2. BMW 320i 9735 25
3. Datsun 510 5079 24
4. Audi 5000 9690 17
5. BMW 320i 9375 25
6. VW Diesel 5397 41
7. BMW 320i 9735 25
We wish to eliminate duplicate observations based on make:
. sort make
. quietly by make: gen dup = cond(_N==1,0,_n)
The final step is drop if dup>1 but, before we do that, we will
list the data so that you can see the values in new variable dup:
. list
make price mpg dup
1. Audi 5000 9690 17 0
2. BMW 320i 9735 25 1
3. BMW 320i 9735 25 2
4. BMW 320i 9375 25 3
5. Datsun 510 5079 24 0
6. VW Diesel 5397 41 1
7. VW Diesel 5397 41 2
. drop if dup>1
(3 observations deleted)
. list
make price mpg dup
1. Audi 5000 9690 17 0
2. BMW 320i 9735 25 1
3. Datsun 510 5079 24 0
4. VW Diesel 5397 41 1
Example 2
We start with the data used in example 1, but this time we drop duplicates
based on all the variables:
. list
make price mpg
1. VW Diesel 5397 41
2. BMW 320i 9735 25
3. Datsun 510 5079 24
4. Audi 5000 9690 17
5. BMW 320i 9375 25
6. VW Diesel 5397 41
7. BMW 320i 9735 25
. local 0
. unab vlist : _all
. quietly by `vlist': gen dup = cond(_N==1,0,_n)
. drop if dup>1
(2 observations deleted)
. list
make price mpg dup
1. Audi 5000 9690 17 0
2. BMW 320i 9375 25 0
3. BMW 320i 9735 25 1
4. Datsun 510 5079 24 0
5. VW Diesel 5397 41 1
We obtained a different result: This time, the BMW appears twice in our
data, once with a price of 9375 and once with price 9735. Presumably, one
of those prices contains a transcription error. When we based the duplicate
identification solely on the contents of variable make, we obtained
one of the BMW observations at random.
An alternative method for identifying duplicates is discussed in the
Technical Note in [R] egen.
|