Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Finding "near"-matches


From   "Michael Blasnik" <[email protected]>
To   <[email protected]>
Subject   st: Re: Finding "near"-matches
Date   Thu, 27 Oct 2005 11:15:42 -0400

"Clyde Schechter" <[email protected]> wrote about trying to match not quite identical text strings between datasets. I also spend a great deal of time trying to match across administrative databases and have developed a few tools to help. There is a fair amount of literature on string comparators (e.g., US Census web site) that produce some rating of the similarity of two text strings. I have coded up a couple of them and tend to use the bigram (which counts the proportion of 2 character substrings that exist in both strings). I have also automated some of the common-typo problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create a new version of each of the strings that replaces all occurences of l and O, with 1 and 0 (and other common errors) before running the string comparison.

If there is interest, I can email the bigram ado file or potentially post it on SSC when I get around to writing up the help.

Michael Blasnik
[email protected] .


*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index