Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: fuzzy match two data sets with strgroup
From
Koleman Strumpf <[email protected]>
To
<[email protected]>
Subject
st: fuzzy match two data sets with strgroup
Date
Fri, 29 Nov 2013 11:48:59 -0600
Hi Statalist:
I have two data sets which I would like to match based on a variable
(Match_Var). The Match_Var is slightliy different in the two files due
to treatment of non-standard characters, truncations of the string, and
some other small changes. But I want to pair the two files up as best as
I can.
I would like to use strgroup for this purpose. The main difficulty I am
encountering is that the main output of this (the group variable) will
match strings regardless of the dataset. For example, the group variable
may contain two observations from the same data set while I am only
interested in matching observations from different data sets. That is I
do something like this,
. use unmatched1.dta, clear
. append using unmatched2.dta
. strgroup Match_Var if _merge!=3, gen(group) threshold(0.02) force
where _merge=1 in unmatched1.dta and _merge=2 in unmatched2.dta. I will
have groups like,
Match_Var _merge group
somestuff123.txt 1 359
somestuff124.txt 1 359
Since I only am interested in matching observations from different files
this is not a useful pairing (that is, _merge is the same for both of
these observations).
I have not been able to avoid these cases. Some things I have tried:
- decrease threshold --> but then I end up missing most of the pairs I want
- increase thresholds --> work better, and include a mix of observations
from the two observations; but then I am not sure I have a one-for-one
pairing between the two files
- sort based on Match_Var after running strgroup (sort group Match_Var
_merge) --> this does not work in my case since it will typically group
all of the observations from one data file first and then the ones from
the other file)
Are there any suggestions on dealing with this? I would imagine this is
a pretty standard use of the fuzzy match so perhaps someone has
encountered this problem and has a solution.
K
PS I know there is also the reclink command which seems to do what I
want. However as has been documented elsewhere in the statalist this
command is fragile and often crashes before giving any results, e.g.
. reclink Match_Var using file1.dta, gen(myscore) idm(id_1) idu(id_2)
0 perfect matches found
Going through 54513 observation to assess fuzzy matches, each .=5%
complete
option KING not allowed
r(198);
I have tried removing as many characters as I could to avoid this but I
can never get the command to make any progress so I have given up on it.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/