Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: fuzzy match two data sets with strgroup


From   Koleman Strumpf <[email protected]>
To   <[email protected]>
Subject   st: fuzzy match two data sets with strgroup
Date   Fri, 29 Nov 2013 11:48:59 -0600

Hi Statalist:

I have two data sets which I would like to match based on a variable (Match_Var). The Match_Var is slightliy different in the two files due to treatment of non-standard characters, truncations of the string, and some other small changes. But I want to pair the two files up as best as I can.

I would like to use strgroup for this purpose. The main difficulty I am encountering is that the main output of this (the group variable) will match strings regardless of the dataset. For example, the group variable may contain two observations from the same data set while I am only interested in matching observations from different data sets. That is I do something like this,
     . use unmatched1.dta, clear
     . append using unmatched2.dta
     . strgroup Match_Var if _merge!=3, gen(group) threshold(0.02) force
where _merge=1 in unmatched1.dta and _merge=2 in unmatched2.dta. I will have groups like,
     Match_Var                    _merge        group
     somestuff123.txt        1                    359
     somestuff124.txt        1                    359
Since I only am interested in matching observations from different files this is not a useful pairing (that is, _merge is the same for both of these observations).

I have not been able to avoid these cases. Some things I have tried:
- decrease threshold --> but then I end up missing most of the pairs I want
- increase thresholds --> work better, and include a mix of observations from the two observations; but then I am not sure I have a one-for-one pairing between the two files - sort based on Match_Var after running strgroup (sort group Match_Var _merge) --> this does not work in my case since it will typically group all of the observations from one data file first and then the ones from the other file)

Are there any suggestions on dealing with this? I would imagine this is a pretty standard use of the fuzzy match so perhaps someone has encountered this problem and has a solution.

K


PS I know there is also the reclink command which seems to do what I want. However as has been documented elsewhere in the statalist this command is fragile and often crashes before giving any results, e.g.
     . reclink Match_Var using file1.dta, gen(myscore) idm(id_1) idu(id_2)
     0 perfect matches found
Going through 54513 observation to assess fuzzy matches, each .=5% complete
     option KING not allowed
     r(198);
I have tried removing as many characters as I could to avoid this but I can never get the command to make any progress so I have given up on it.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index