Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Finding "near"-matches

From   "Michael Blasnik" <>
To   <>
Subject   st: Re: Finding "near"-matches
Date   Thu, 27 Oct 2005 11:15:42 -0400

"Clyde Schechter" <> wrote about trying to match not quite identical text strings between datasets. I also spend a great deal of time trying to match across administrative databases and have developed a few tools to help. There is a fair amount of literature on string comparators (e.g., US Census web site) that produce some rating of the similarity of two text strings. I have coded up a couple of them and tend to use the bigram (which counts the proportion of 2 character substrings that exist in both strings). I have also automated some of the common-typo problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create a new version of each of the strings that replaces all occurences of l and O, with 1 and 0 (and other common errors) before running the string comparison.

If there is interest, I can email the bigram ado file or potentially post it on SSC when I get around to writing up the help.

Michael Blasnik .

* For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index