Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: extension of countmatch


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: extension of countmatch
Date   Tue, 20 Apr 2010 18:44:37 +0100

The original question was about -countmatch-, which was not explained. I
checked with -findit- and discovered that I am the author. This rather
underscores the advice about explaining where user-written programs come
from. 

-countmatch- is really some way away from the problem here, so it's
doubly fortunate that Robert has good code ready and waiting. 

Nick 
n.j.cox@durham.ac.uk 

Robert Picard

This looks like a problem I have had to handle before. I created a
program to group identifiers when values match for specified
variables. You can get it by typing, in Stata:

net from http://robertpicard.com/stata

I have prepared an example, starting from your example data and I have
added a few extra lines that show other id combinations. The last id
"aa9" is the same as "aa7" because each share a previous name ("bb4").

To get this to work, an initial newid variable is created to uniquely
identify each observation. Each observation is duplicated and the
variable an_id contains all name variations for each value of newid.

Then -group_id- does its magic and groups all your initial ids
together. Of the 4 extra observations that I added, those that share
"bb4" have been grouped together.

Hope this helps,

*--------------------------- begin example -----------------------

version 11

clear
input str32(final_id id1 id2 id3)
aaa aa1 aa2 aa3
aa3 bb1 bb2
bb1 ll1
aa4 aa5 aa6
aa7 bb3 bb4
aa8 bb5 bb6 bb7
aa9 a10 bb4
end

list, noobs sep(0)

// create a new identifier
gen newid = _n
rename final_id an_id
tempfile f
qui save "`f'"

// create duplicate observations for each newid
foreach v in id1 id2 id3 {
	keep newid `v'
	rename `v' an_id
	keep if an_id != ""
	append using "`f'"
	qui save "`f'", replace
}
sort newid an_id
list , sepby(newid)

// create a final merged_id, starting from newid
gen merged_id = newid

// type net from http://robertpicard.com/stata to get -group_id-
group_id merged_id, matchby(an_id)

sort merged_id newid an_id
list, sepby(merged_id)

*--------------------- end example --------------------------


On Tue, Apr 20, 2010 at 12:08 PM, Dalhia <ggs_da@yahoo.com> wrote:

> I need to do a particular data manipulation to reconcile multiple ids
created over time. There are multiple rows (15,345 rows which require
reconciliation) so I will be really grateful if this can be somehow
automated.
>
> here is how the data looks
>
> final_id, id1, id2, id3
> aaa, aa1, aa2, aa3
> aa3, bb1, bb2
> bb1, ll1
>
> In this example, all the ids are actually referring to the same entity
since aa3 is actually also bb1 and bb2, and bb1 is also ll1. Here is how
I am trying to get the data to look so I know that they all actually are
the same entity:
> final_id, id1, id2, id3, id4, id5, id6
> aaa, aa1, aa2, aa3, bb1, bb2, ll1
>
> I was playing with somehow extending countmatch (which tells me when
the same cell appears in other rows in other variables) so that it can
identify these duplicates in other variables, and then also pull them
out. But so far no luck. I am horrible at figuring out code. Any help
will be appreciated.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index