Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Matching Strings


From   cschecht@aecom.yu.edu
To   statalist@hsphsun2.harvard.edu
Subject   RE: st: Matching Strings
Date   Sun, 24 Nov 2002 14:06:48 -0500

Pedro Martins wants to match up names from two different files and do a loose
match because of misspellings.  Donald Knuth invented soundex codes for just
this purpose.  A description of the soundex algorithm can be found in the
statalist archives.  In Stata code, you want something like this

use firstname lastname using file1, replace
egen sfn = soundex(firstname)
egen sln = soundex(lastname)
sort sln sfn lastname firstname
tempfile temp1
save `temp1'
use firstname lastname using file2, replace
egen sfn = soundex(firstname)
egen sln = soundex(lastname)
sort sln sfn lastname firstname
joinby sln sfn using `temp1'


Depending on the relationships between the records in file1 and file2 you will
want to choose an appropriate option for handling unmatched records in the
-joinby- command.

The resulting file will pair names with all other names that are reasonable
spelling variants.  Soundex is not part of official Stata; it is a user-written
egen function which you can locate and download using -findit-.  If you do this
kind of work with any frequency, soundex is an indispensible tool.  It isn't
perfect: you will still have to do some hand cleaning and searching--but usually
it will get you almost all the way there.

By the way, my experience with soundex is that while it works beautifully for
matching up names, it doesn't perform as well for matching other types of
vocabulary.  

Good luck.

Clyde Schechter
Dept. of Family Medicine & Community Health
Albert Einstein College of Medicine
Bronx, New York, USA


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index