Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: matching misspelled names


From   "Michael Blasnik" <mblasnik@verizon.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Re: matching misspelled names
Date   Fri, 23 Aug 2002 18:44:09 -0400

> I have a  dataset, one of whose variables contains names of drugs.  Many
of
> the entries are misspelled or truncated.  I have an index file with a
> reasonably complete list of commercial and generic drug names.  After
> merging the files and identifying exact matches, I would like to try to
> match the remaining, presumably misspelled, drug names with a
corresponding
> correct name from the index.  When the names are of people, the soundex
> algorithm usually provides a reasonably short list of candidate matches.
> But trying it with these drug names, many of the misspellings match with
> several dozen candidates, making the resulting list of names and candidate
> matches for manual review and selection unworkably long.
>
> Does anybody out there know of an alternative to soundex coding that might
> work better in this peculiar vocabulary?  Or of another approach to this
> problem?
>
> Thanks in advance for any help.

I often run into similar problems matching text data from administrative
databases.  Soundex, by itself, was really designed as a lookup approach to
get "close" to a match so a data entry person can then manually locate the
correct match quicker, or as a way to check names after matching on some
other identifer.  I have an ado file that calculates the longest overlapping
substring between two strings.  You could match on soundex (or perhaps
extend the soundex to more than the default 4 characters to get fewer
matches) and then calculate the maximum substring overlap on the candidate
matches and then select the match with the most overlap.  This approach is
certainly not the most elegant, but can automate a lot of the task.  You may
also want to check out other string matching algorithms (I think Jaro is
one, check the census bureau's web site for some papers on this topic).

If you come up with a good approach, please post it back to the list...

Michael Blasnik



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index