Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: matching misspelled names


From   Laura Giuliano <giuliano@econ.Berkeley.EDU>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: matching misspelled names
Date   Fri, 23 Aug 2002 16:52:18 -0700

You could try using the Levenshtein or “edit distance” measure of the
similarity between two strings.  (This is just equal to the number of
deletions, insertions, or substitutions required to transform string 1 into
string 2.  For example, the edit distance between "cat" and "car" is 1
(because one substitution, from “t” to “r” is required); the distance from
"cat" to "carton" is 3.)

I’ve written a program that calculates this distance.  Maybe you could use
this to write a program that loops through the names in the second list for
each name in the first list, calculates the edit distance, and creates an
output file with close matches.

Here's my program:

/********************************************
This program takes two words and return the edit distance between them.
********************************************/
capture program drop lev
program define lev
	args word1 word2
	
	local L1 = length("`word1'")
	local L2 = length("`word2'")

	matrix A=J(`L2'+1, `L1'+1, 0)					/* Create matrix of zero's */	

	forval i = 0 / `L1'			{				/* Initialize first row */
		matrix A[1,`i'+1] = `i'		
	}

	forval j = 1 / `L2' 		{					/* Initialize first column */
		matrix A[`j'+1,1] = `j'		
	}

	forval j = 1 / `L2'	{						/* Loop through characters of word 2 */
		forval i = 1 / `L1' 	{					/* Loop through characters of word 1 */

			if  substr("`word2'", `j', 1) == substr("`word1'", `i', 1) {
				local cost=0
			}							/*  Set cost=0 if there's a match; */
			else {local cost=1}					/*          cost=1 otherwise	*/
	
			local m = 1 + A[`j', `i'+1]			/* Set current cell equal to min of:	*/
			local n = 1 + A[`j'+1, `i']			/*  cell directly above (m) + 1		*/
			local d = `cost' + A[`j', `i']			/*  cell directly to left (n) + 1		*/
			matrix A[`j'+1,`i'+1]=min(`m',`n',`d')		/*  cell diagonal above & left
(d) + cost  */
		}
	}
	
	local lev = A[`L2'+1, `L1'+1]					/* Grab the value from bottom, right cell */
								
	di "Levenshtein distance between `word1' and `word2' is `lev' "

end

At 06:44 PM 8/23/02 -0400, you wrote:
>> I have a  dataset, one of whose variables contains names of drugs.  Many
>of
>> the entries are misspelled or truncated.  I have an index file with a
>> reasonably complete list of commercial and generic drug names.  After
>> merging the files and identifying exact matches, I would like to try to
>> match the remaining, presumably misspelled, drug names with a
>corresponding
>> correct name from the index.  When the names are of people, the soundex
>> algorithm usually provides a reasonably short list of candidate matches.
>> But trying it with these drug names, many of the misspellings match with
>> several dozen candidates, making the resulting list of names and candidate
>> matches for manual review and selection unworkably long.
>>
>> Does anybody out there know of an alternative to soundex coding that might
>> work better in this peculiar vocabulary?  Or of another approach to this
>> problem?
>>
>> Thanks in advance for any help.
>
>I often run into similar problems matching text data from administrative
>databases.  Soundex, by itself, was really designed as a lookup approach to
>get "close" to a match so a data entry person can then manually locate the
>correct match quicker, or as a way to check names after matching on some
>other identifer.  I have an ado file that calculates the longest overlapping
>substring between two strings.  You could match on soundex (or perhaps
>extend the soundex to more than the default 4 characters to get fewer
>matches) and then calculate the maximum substring overlap on the candidate
>matches and then select the match with the most overlap.  This approach is
>certainly not the most elegant, but can automate a lot of the task.  You may
>also want to check out other string matching algorithms (I think Jaro is
>one, check the census bureau's web site for some papers on this topic).
>
>If you come up with a good approach, please post it back to the list...
>
>Michael Blasnik
>
>
>
>*
>*   For searches and help try:
>*   http://www.stata.com/support/faqs/res/findit.html
>*   http://www.stata.com/support/statalist/faq
>*   http://www.ats.ucla.edu/stat/stata/

- Laura Giuliano

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index