Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Fwd: identifying duplicate entry errors

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: RE: Fwd: identifying duplicate entry errors
Date	Tue, 25 Feb 2014 17:56:31 +0000

The aim of -duplicates- is, hmm, to identify duplicates. But it is not
the only tool to identify duplicates. Let's suppose first that you
want to identify duplicates based on -a b c-. Then

egen group = group(a b c), label

groups observations identical on -a b c-.

su group

tells you how many groups that means. Now suppose you have further
interest in -d e f-. Consider

bysort group (d) :

If you sort on -d- within distinct groups of -group- then any
different values are shaken apart. (In fact, you don't need to create
-group- to do this.). Let's take this further to identify what's
variable within distinct values of -group-.

gen whatvaries = ""

foreach v in d e f {
      bysort group (`v') : replace whatvaries = ///
      whatvaries + cond(`v'[_N] != `v'[1], "`v' ", "")
}


The analysis of -whatvaries- might not be easy, but it's what you seem
to be asking for.

See also

How do I list observations in a group that differ on a variable?
http://www.stata.com/support/faqs/data-management/listing-observations-in-group/

Nick
[email protected]


On 25 February 2014 17:40, Alison El Ayadi <[email protected]> wrote:
> Thanks for your suggestion.  I have run through a number of different
> combinations of listing the duplicates, but I suspect that there are
> duplicates that I can identify when limiting to certain variables but
> that I do not obtain when including all variables due to data entry
> errors.  That's why it would be so great to have something that says
> these are duplicate groups based on var x, var, and var z, and here is
> the total number of variables that differ between the duplicate
> observation.
>
> Best,
> Alison
>
> On Tue, Feb 25, 2014 at 9:15 AM, Joe Canner <[email protected]> wrote:
>> Alison,
>>
>> Have you tried -duplicates list-?  This is probably not as helpful as you would like, but it's a start.  I have had similar wishes for the -duplicate- command.  If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Alison El Ayadi
>> Sent: Tuesday, February 25, 2014 11:07 AM
>> To: [email protected]
>> Subject: st: Fwd: identifying duplicate entry errors
>>
>> Dear Statalisters,
>>
>> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Fwd: identifying duplicate entry errors
  - From: Alison El Ayadi <[email protected]>
- st: RE: Fwd: identifying duplicate entry errors
  - From: Joe Canner <[email protected]>
- Re: st: RE: Fwd: identifying duplicate entry errors
  - From: Alison El Ayadi <[email protected]>

Prev by Date: Re: st: How to graph several mean trajectories and combine them in a single graph
Next by Date: Re: st: How to graph several mean trajectories and combine them in a single graph
Previous by thread: Re: st: RE: Fwd: identifying duplicate entry errors
Next by thread: Re: st: RE: Fwd: identifying duplicate entry errors
Index(es):
- Date
- Thread