[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Detecting Duplicate Records

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	Re: st: Detecting Duplicate Records
Date	Thu, 15 Aug 2002 12:10:40 +0100

Amani Siyam wrote

I have a household roster data file which consists of about 20
variables measured on household members.  I have my doubts that the
persons_id within a household is not unique.  Is there a way I can
"mass-check" all 20 variables between members of the same households
to determine duplicate records.  I thought of the following:

sort hhid persons_id

for var V1-V20: gen DX=X[_n]==X[_n-1]

quietly by hhid: egen DSUM=rsum( DV1. .... DV20)

quietly by hhid: drop if DSUM[n]==20

Does that make sense!

>>> As others have pointed out, there
are canned solutions in this territory. -findit
duplicates- points to several as well as to an
FAQ.

The basic idea behind Amani's code is

1. get observations in the right sort order

2. if this observation's the same as the previous
one, it is a duplicate

However, getting all the details right
is tricky.

a. Note that sorting by -hhid persons_id-
does not guarantee the needed sort order
w.r.t. other variables. Here's a counter-example:

hh_id persons_id v1 v2 v3
1       1         1  1  1
1       1         2  2  2
1       1         1  1  1

The first and last observation are
duplicates, but would not be recognised
as such. Thus you need to -sort- on
all variables of interest.

b. Looking at the previous observation
on the v* alone could declare false duplicates:

hh_id persons_id v1 v2 v3
1      1          1  1  1
1      2          1  1  1

c. In Amani's last two statements, the

by hhid:

does nothing as the row sum and
-drop- are just observation by
observation: the operation is
the same with and without -by:-.
It does no harm, however.

However, it's in the previous statement that a

by hhid persons_id:

_is_ needed, to fix the bug in b.

As it happens, Stata provides all the machinery
you need to do this more concisely and directly.

bysort varlist: list varlist if _N > 2

gives a full report of duplicates, at the price of
much unnecessary output.

by varlist : egen dups = sum(_N > 1)

is another useful one-liner, as
explained at [R] egen p.417.

Nick
[email protected]


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: Re: spaces between variables
Next by Date: st: Re: dummies from continuous variable
Previous by thread: RE: st: Detecting Duplicate Records
Next by thread: RE: st: Detecting Duplicate Records
Index(es):
- Date
- Thread