Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: RE: Fwd: identifying duplicate entry errors |
Date | Tue, 25 Feb 2014 17:56:31 +0000 |
The aim of -duplicates- is, hmm, to identify duplicates. But it is not the only tool to identify duplicates. Let's suppose first that you want to identify duplicates based on -a b c-. Then egen group = group(a b c), label groups observations identical on -a b c-. su group tells you how many groups that means. Now suppose you have further interest in -d e f-. Consider bysort group (d) : If you sort on -d- within distinct groups of -group- then any different values are shaken apart. (In fact, you don't need to create -group- to do this.). Let's take this further to identify what's variable within distinct values of -group-. gen whatvaries = "" foreach v in d e f { bysort group (`v') : replace whatvaries = /// whatvaries + cond(`v'[_N] != `v'[1], "`v' ", "") } The analysis of -whatvaries- might not be easy, but it's what you seem to be asking for. See also How do I list observations in a group that differ on a variable? http://www.stata.com/support/faqs/data-management/listing-observations-in-group/ Nick njcoxstata@gmail.com On 25 February 2014 17:40, Alison El Ayadi <aelayadi@gmail.com> wrote: > Thanks for your suggestion. I have run through a number of different > combinations of listing the duplicates, but I suspect that there are > duplicates that I can identify when limiting to certain variables but > that I do not obtain when including all variables due to data entry > errors. That's why it would be so great to have something that says > these are duplicate groups based on var x, var, and var z, and here is > the total number of variables that differ between the duplicate > observation. > > Best, > Alison > > On Tue, Feb 25, 2014 at 9:15 AM, Joe Canner <jcanner1@jhmi.edu> wrote: >> Alison, >> >> Have you tried -duplicates list-? This is probably not as helpful as you would like, but it's a start. I have had similar wishes for the -duplicate- command. If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command. >> >> Regards, >> Joe Canner >> Johns Hopkins University School of Medicine >> >> -----Original Message----- >> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Alison El Ayadi >> Sent: Tuesday, February 25, 2014 11:07 AM >> To: statalist@hsphsun2.harvard.edu >> Subject: st: Fwd: identifying duplicate entry errors >> >> Dear Statalisters, >> >> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates). * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/