Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: Finding "near"-matches


From   Roger Newson <roger.newson@kcl.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: Finding "near"-matches
Date   Fri, 28 Oct 2005 17:52:46 +0100

Yhis sounds like a job for something similar to Donald Knuth's soundex algorithm. You can find out more about this at

http://www.dcs.ed.ac.uk/home/stg/pub/S/soundex.html

or at

http://us2.php.net/soundex

or at

http://west-penwith.org.uk/misc/soundex.htm

I hope this helps.

Roger


At 16:49 28/10/2005, you wrote:

The topic gets more and more interesting. I often need to match
'fuzzily' the names from two databases that have very minor
differences. here are some examples:

Ford Co.
Ford Corporation
Ford Inc. (just an example)

or

XYZ Tech
XYZ Technology Inc.

Can you recommend some programs to generate a list of 'fuzzy' or
'near' matches for a name (one or more than one alphanumeric
characters)? Even if a program provides the three possible matches for
the name 'Ford', that's still better than hand-checking.

Aaron


On 10/28/05, Seb Buechte <sfbuechte@gmail.com> wrote:
> Clyde and Michael,
>
> I also programmed something to find out how similar two strings are
> using the edit-distance-method. The edit-distance between two strings
> is the number of changes required to change one string in such way
> that it equals the other. I admit that what I programmed is somehow
> "quick&dirty" code. If you would like, I can email it to you, but if
> you would like to know how it works you could check out this website:
>
> http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/
>
> There you find a description of the underlying algorithm.
>
> Kind regards,
> sebastian
>
> On 10/27/05, Michael Blasnik <michael.blasnik@verizon.net> wrote:
> > "Clyde Schechter" <cschecht@aecom.yu.edu> wrote about trying to match not
> > quite identical text strings between datasets. I also spend a great deal of
> > time trying to match across administrative databases and have developed a
> > few tools to help. There is a fair amount of literature on string
> > comparators (e.g., US Census web site) that produce some rating of the
> > similarity of two text strings. I have coded up a couple of them and tend
> > to use the bigram (which counts the proportion of 2 character substrings
> > that exist in both strings). I have also automated some of the common-typo
> > problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create
> > a new version of each of the strings that replaces all occurences of l and
> > O, with 1 and 0 (and other common errors) before running the string
> > comparison.
> >
> > If there is interest, I can email the bigram ado file or potentially post it
> > on SSC when I get around to writing up the help.
> >
> > Michael Blasnik
> > michael.blasnik@verizon.net .
> >
> >
> > *
> > * For searches and help try:
> > * http://www.stata.com/support/faqs/res/findit.html
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
> >
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

--
Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
Division of Asthma, Allergy and Lung Biology
King's College London

5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
  or 020 7848 6605 International +44 20 7848 6605
Email: roger.newson@kcl.ac.uk
Website: http://phs.kcl.ac.uk/rogernewson/

Opinions expressed are those of the author, not the institution.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index