[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Detecting Duplicate Records |

Date |
Thu, 15 Aug 2002 12:10:40 +0100 |

Amani Siyam wrote I have a household roster data file which consists of about 20 variables measured on household members. I have my doubts that the persons_id within a household is not unique. Is there a way I can "mass-check" all 20 variables between members of the same households to determine duplicate records. I thought of the following: sort hhid persons_id for var V1-V20: gen DX=X[_n]==X[_n-1] quietly by hhid: egen DSUM=rsum( DV1. .... DV20) quietly by hhid: drop if DSUM[n]==20 Does that make sense! >>> As others have pointed out, there are canned solutions in this territory. -findit duplicates- points to several as well as to an FAQ. The basic idea behind Amani's code is 1. get observations in the right sort order 2. if this observation's the same as the previous one, it is a duplicate However, getting all the details right is tricky. a. Note that sorting by -hhid persons_id- does not guarantee the needed sort order w.r.t. other variables. Here's a counter-example: hh_id persons_id v1 v2 v3 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 The first and last observation are duplicates, but would not be recognised as such. Thus you need to -sort- on all variables of interest. b. Looking at the previous observation on the v* alone could declare false duplicates: hh_id persons_id v1 v2 v3 1 1 1 1 1 1 2 1 1 1 c. In Amani's last two statements, the by hhid: does nothing as the row sum and -drop- are just observation by observation: the operation is the same with and without -by:-. It does no harm, however. However, it's in the previous statement that a by hhid persons_id: _is_ needed, to fix the bug in b. As it happens, Stata provides all the machinery you need to do this more concisely and directly. bysort varlist: list varlist if _N > 2 gives a full report of duplicates, at the price of much unnecessary output. by varlist : egen dups = sum(_N > 1) is another useful one-liner, as explained at [R] egen p.417. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: Re: spaces between variables** - Next by Date:
**st: Re: dummies from continuous variable** - Previous by thread:
**RE: st: Detecting Duplicate Records** - Next by thread:
**RE: st: Detecting Duplicate Records** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |