Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Fwd: identifying duplicate entry errors


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: RE: Fwd: identifying duplicate entry errors
Date   Tue, 25 Feb 2014 19:07:26 +0000

Disclosure notice: I am credited as the original author of
-duplicates-. It is now an official command, but there is a lingering
parental bond.

The challenge for -duplicates- is that (some) people want very simple
output for arbitrarily complicated situations in arbitrarily large
datasets. It is easy to see that -duplicates- can be guaranteed to
work simply only when the data are so simple that you hardly need it.

-duplicates list- lists groups when there are duplicates on the
variables specified. It is difficult to see how that would work with
other variables that are not constant! At some point you are just
reinventing -list- with the -sort- order you want.

My earlier answer gave one strategy, however.

Nick
[email protected]


On 25 February 2014 18:56, Joe Canner <[email protected]> wrote:
> Sergiy,
>
> Thanks for the caveats.  I can't say that I've thought through this too much, nor do I have any immediate plans in this direction.  Although I work a lot with big data sets, the circumstances where I am most often looking for duplicates tend to be smaller, locally-collected data sets where one can actually investigate and fix duplicates.  But I take your point that summarizing how two records are different can be problematic.  Perhaps one would need to specify the level of dis-similarity that they are interested in, e.g., if I am looking a 10 variables, list the records that differ on at most 2 variables, and list those variables.  If the latter number is small, the output shouldn't be too bad.
>
> My original interest in modifying the -duplicates- command stems from a desire to have a more informative -duplicates list- function, i.e., to be able to list other variables besides the one that match.  That, to me, would be very useful in determine how similar the so-called duplicates actually are.  However, -duplicates list- will only list the variables that are used to determine duplication.
>
> Perhaps Nick's answer to this question will deal with my original question as well.
>
> Regards,
> Joe
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Sergiy Radyakin
> Sent: Tuesday, February 25, 2014 12:41 PM
> To: [email protected]
> Subject: Re: st: RE: Fwd: identifying duplicate entry errors
>
> Joe, programming this is perhaps the easiest part. But, how do you imagine the output of this command?
>
> I hope you don't want to list the values.
>
> For example for the case you have 1mln obs, with roughly 20% duplicates on 10vars and differences in any of the other 30vars with multiple duplications (not just 2, but say 22 with one id).
>
> diff of first vs second
> diff of first vs third
> ...
> diff of first vs twenty-second
> diff of second vs third
> ...
> diff of twenty-first vs twenty-second?
>
> How big would the output be for the case of N obs, with VAR1=const, and VAR2=_n? I am thinking factorial(N-1)?
>
> If you only want to list the vars that are different, then it might not hold for all the duplication groups.
>
> id age income
> 1 33 1
> 1 33 2
> 1 32 3
> 2 45 70
> 2 46 71
> 2 47 77
>
> dups id, diff()
>
> Should the output be : income? (all obs are different in income), or
> age+income? (some observations also differ by age). Now imagine you
> have a 1mln -obs dataset.
>
> Best, Sergiy Radyakin
>
>
>
>
>
> On Tue, Feb 25, 2014 at 12:15 PM, Joe Canner <[email protected]> wrote:
>> Alison,
>>
>> Have you tried -duplicates list-?  This is probably not as helpful as you would like, but it's a start.  I have had similar wishes for the -duplicate- command.  If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Alison El
>> Ayadi
>> Sent: Tuesday, February 25, 2014 11:07 AM
>> To: [email protected]
>> Subject: st: Fwd: identifying duplicate entry errors
>>
>> Dear Statalisters,
>>
>> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).
>>
>> Does anyone have any ideas about this?
>> Thanks so much,
>> Alison
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index