[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Re: Finding "near"-matches
Yhis sounds like a job for something similar to Donald Knuth's soundex
algorithm. You can find out more about this at
I hope this helps.
At 16:49 28/10/2005, you wrote:
The topic gets more and more interesting. I often need to match
'fuzzily' the names from two databases that have very minor
differences. here are some examples:
Ford Inc. (just an example)
XYZ Technology Inc.
Can you recommend some programs to generate a list of 'fuzzy' or
'near' matches for a name (one or more than one alphanumeric
characters)? Even if a program provides the three possible matches for
the name 'Ford', that's still better than hand-checking.
On 10/28/05, Seb Buechte <email@example.com> wrote:
> Clyde and Michael,
> I also programmed something to find out how similar two strings are
> using the edit-distance-method. The edit-distance between two strings
> is the number of changes required to change one string in such way
> that it equals the other. I admit that what I programmed is somehow
> "quick&dirty" code. If you would like, I can email it to you, but if
> you would like to know how it works you could check out this website:
> There you find a description of the underlying algorithm.
> Kind regards,
> On 10/27/05, Michael Blasnik <firstname.lastname@example.org> wrote:
> > "Clyde Schechter" <email@example.com> wrote about trying to match not
> > quite identical text strings between datasets. I also spend a great
> > time trying to match across administrative databases and have developed a
> > few tools to help. There is a fair amount of literature on string
> > comparators (e.g., US Census web site) that produce some rating of the
> > similarity of two text strings. I have coded up a couple of them and
> > to use the bigram (which counts the proportion of 2 character substrings
> > that exist in both strings). I have also automated some of the
> > problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply
> > a new version of each of the strings that replaces all occurences of
> > O, with 1 and 0 (and other common errors) before running the string
> > comparison.
> > If there is interest, I can email the bigram ado file or potentially
> > on SSC when I get around to writing up the help.
> > Michael Blasnik
> > firstname.lastname@example.org .
> > *
> > * For searches and help try:
> > * http://www.stata.com/support/faqs/res/findit.html
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
* For searches and help try:
Lecturer in Medical Statistics
Department of Public Health Sciences
Division of Asthma, Allergy and Lung Biology
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605
Opinions expressed are those of the author, not the institution.
* For searches and help try: