Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Joe Canner <jcanner1@jhmi.edu> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | st: RE: RE: 'Fuzzy' text match |
Date | Sun, 23 Mar 2014 21:36:30 +0000 |
P.S. If you haven't already, check out -reclink-, -vmatch-, and -nearmrg-, all available from SSC. I don't know how they handle this problem, but they might be worth a look. ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Joe Canner [jcanner1@jhmi.edu] Sent: Sunday, March 23, 2014 5:30 PM To: statalist@hsphsun2.harvard.edu Subject: st: RE: 'Fuzzy' text match Robert, Do all comparisons between the two data sets follow the same pattern, e.g., the name in one file is exactly contained within the name in the other file? If so, you can use the -strpos()- function. This will still be challenging to do as a -merge-, but if you come back with a positive answer to the above question, I (or someone else here) can suggest some code that might work in this situation. It would probably involve using the shorter file as a look-up table for the longer file. Regards, Joe Canner Johns Hopkins University School of Medicine ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Robert Davidson [rhd773@gmail.com] Sent: Sunday, March 23, 2014 5:15 PM To: statalist@hsphsun2.harvard.edu Subject: st: 'Fuzzy' text match Dear Statalist, I am trying to do a text match across two files in Stata 13 in which the names I want to match will not be the same in the two files. I have looked into options here and tried a few, including strgroup, but these do not work for the following reason: in one file I have company name e.g. Ford Motor Company, and in the other file I have facility name e.g. Warren Engine Plant Ford Motor Company. strgroup does not consider these two strings as even remotely close (Levenshtein distance is 22 here) and treats words that have nothing in common as being much closer. Is there a way to measure how much of one string appears in another so that cases like the above example might be considered reasonably close? To use strgroup with a threshold that would include a match like above, I will wind up with about 98% false matches. Also, my two datasets are about 1,000 observations and 1,000,000 observations so doing something manually is quite cumbersome. Thank you, Robert Davidson * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/