Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Nick Cox" <n.j.cox@durham.ac.uk> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: extension of countmatch |
Date | Tue, 20 Apr 2010 18:44:37 +0100 |
The original question was about -countmatch-, which was not explained. I checked with -findit- and discovered that I am the author. This rather underscores the advice about explaining where user-written programs come from. -countmatch- is really some way away from the problem here, so it's doubly fortunate that Robert has good code ready and waiting. Nick n.j.cox@durham.ac.uk Robert Picard This looks like a problem I have had to handle before. I created a program to group identifiers when values match for specified variables. You can get it by typing, in Stata: net from http://robertpicard.com/stata I have prepared an example, starting from your example data and I have added a few extra lines that show other id combinations. The last id "aa9" is the same as "aa7" because each share a previous name ("bb4"). To get this to work, an initial newid variable is created to uniquely identify each observation. Each observation is duplicated and the variable an_id contains all name variations for each value of newid. Then -group_id- does its magic and groups all your initial ids together. Of the 4 extra observations that I added, those that share "bb4" have been grouped together. Hope this helps, *--------------------------- begin example ----------------------- version 11 clear input str32(final_id id1 id2 id3) aaa aa1 aa2 aa3 aa3 bb1 bb2 bb1 ll1 aa4 aa5 aa6 aa7 bb3 bb4 aa8 bb5 bb6 bb7 aa9 a10 bb4 end list, noobs sep(0) // create a new identifier gen newid = _n rename final_id an_id tempfile f qui save "`f'" // create duplicate observations for each newid foreach v in id1 id2 id3 { keep newid `v' rename `v' an_id keep if an_id != "" append using "`f'" qui save "`f'", replace } sort newid an_id list , sepby(newid) // create a final merged_id, starting from newid gen merged_id = newid // type net from http://robertpicard.com/stata to get -group_id- group_id merged_id, matchby(an_id) sort merged_id newid an_id list, sepby(merged_id) *--------------------- end example -------------------------- On Tue, Apr 20, 2010 at 12:08 PM, Dalhia <ggs_da@yahoo.com> wrote: > I need to do a particular data manipulation to reconcile multiple ids created over time. There are multiple rows (15,345 rows which require reconciliation) so I will be really grateful if this can be somehow automated. > > here is how the data looks > > final_id, id1, id2, id3 > aaa, aa1, aa2, aa3 > aa3, bb1, bb2 > bb1, ll1 > > In this example, all the ids are actually referring to the same entity since aa3 is actually also bb1 and bb2, and bb1 is also ll1. Here is how I am trying to get the data to look so I know that they all actually are the same entity: > final_id, id1, id2, id3, id4, id5, id6 > aaa, aa1, aa2, aa3, bb1, bb2, ll1 > > I was playing with somehow extending countmatch (which tells me when the same cell appears in other rows in other variables) so that it can identify these duplicates in other variables, and then also pull them out. But so far no luck. I am horrible at figuring out code. Any help will be appreciated. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/