Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: reshape and duplicates


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: reshape and duplicates
Date   Wed, 2 Apr 2008 18:34:53 +0100

In addition, consider the following trick: 

gen first = cond(name1 < name2, name1, name2) 
gen second = cond(name1 < name2, name2, name1) 
duplicates <whatever> first second  

That doesn't fix any issues with spelling (wide sense, i.e. case or
leading or trailing or embedded spaces), but it addresses the A and B =
B and A detail. 

Nick
n.j.cox@durham.ac.uk 

Joseph Coveney

Jennifer Nicoll Victor wrote:

Thank you Nick, for recommending the reshape command to me last week.  I
now
have converted my UCINET relational dataset into dyads in Stata.
However, I
now have the problem of duplicate observations.  My data are
non-directional
so the pair A-B is the same as the pair B-A.  I need to efficiently
delete
the duplicates.  I need only the unique observations, where the unit of
analysis is a pair.  Can someone help?

Essentially, I have...
ID1  ID2        name1   name2 ...
1       2       Smith, John     Jones, Abby
1       3       Smith, John     White, Rich
1       4       Smith, John     Black, Kelly
2       1       Jones, Abby     Smith, John
2       3       Jones, Abby     White, Rich
2       4       Jones, Abby     Black, Kelly
3       1       White, Rich     Smith, John
3       2       White, Rich     Jones, Abby
3       4       White, Rich     Black, Kelly
4       1       Black, Kelly    Smith, John
4       2       Black, Kelly    Jones, Abby
4       3       Black, Kelly    White, Rich

And I need to have....
ID1  ID2        name1   name2 ...
1       2       Smith, John     Jones, Abby
1       3       Smith, John     White, Rich
1       4       Smith, John     Black, Kelly
2       3       Jones, Abby     White, Rich
2       4       Jones, Abby     Black, Kelly
3       4       White, Rich     Black, Kelly

But I have 191,406 pairs.

------------------------------------------------------------------------
--------

The do-file below gets what you want.  Sorting 200 000 observations took
1.01 seconds on my laptop, so if the approach below takes a few moments
on
your dataset, then it's probably to do with the -min()- and -max()-.
You
also might be able to avoid the situation by doing something
pre-emptively
upstream.

Joseph Coveney

clear *
set more off
input byte ID1 byte ID2 str10 name1 str1 comma1 str10 name2 str10 name3
str1
comma2 str10 name4
1       2       Smith, John     Jones, Abby
1       3       Smith, John     White, Rich
1       4       Smith, John     Black, Kelly
2       1       Jones, Abby     Smith, John
2       3       Jones, Abby     White, Rich
2       4       Jones, Abby     Black, Kelly
3       1       White, Rich     Smith, John
3       2       White, Rich     Jones, Abby
3       4       White, Rich     Black, Kelly
4       1       Black, Kelly    Smith, John
4       2       Black, Kelly    Jones, Abby
4       3       Black, Kelly    White, Rich
end
replace name1 = name1 + ", " + name2
replace name2 = name3 + ", " + name4
keep ID* name1 name2
format name* %-`=max(length(name1), length(name2))'s
*
* Begin here
*
generate str dyad_id = string(min(ID1, ID2)) + "-" + string(max(ID1,
ID2))
bysort dyad_id: keep if _n == 1
list, noobs separator(0)
exit

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index