Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Fwd: identifying duplicate entry errors

From   Nick Cox <>
To   "" <>
Subject   Re: st: RE: Fwd: identifying duplicate entry errors
Date   Tue, 25 Feb 2014 17:56:31 +0000

The aim of -duplicates- is, hmm, to identify duplicates. But it is not
the only tool to identify duplicates. Let's suppose first that you
want to identify duplicates based on -a b c-. Then

egen group = group(a b c), label

groups observations identical on -a b c-.

su group

tells you how many groups that means. Now suppose you have further
interest in -d e f-. Consider

bysort group (d) :

If you sort on -d- within distinct groups of -group- then any
different values are shaken apart. (In fact, you don't need to create
-group- to do this.). Let's take this further to identify what's
variable within distinct values of -group-.

gen whatvaries = ""

foreach v in d e f {
      bysort group (`v') : replace whatvaries = ///
      whatvaries + cond(`v'[_N] != `v'[1], "`v' ", "")

The analysis of -whatvaries- might not be easy, but it's what you seem
to be asking for.

See also

How do I list observations in a group that differ on a variable?


On 25 February 2014 17:40, Alison El Ayadi <> wrote:
> Thanks for your suggestion.  I have run through a number of different
> combinations of listing the duplicates, but I suspect that there are
> duplicates that I can identify when limiting to certain variables but
> that I do not obtain when including all variables due to data entry
> errors.  That's why it would be so great to have something that says
> these are duplicate groups based on var x, var, and var z, and here is
> the total number of variables that differ between the duplicate
> observation.
> Best,
> Alison
> On Tue, Feb 25, 2014 at 9:15 AM, Joe Canner <> wrote:
>> Alison,
>> Have you tried -duplicates list-?  This is probably not as helpful as you would like, but it's a start.  I have had similar wishes for the -duplicate- command.  If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>> -----Original Message-----
>> From: [] On Behalf Of Alison El Ayadi
>> Sent: Tuesday, February 25, 2014 11:07 AM
>> To:
>> Subject: st: Fwd: identifying duplicate entry errors
>> Dear Statalisters,
>> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index