Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Finding "near"-matches


From   Clyde Schechter <cschecht@aecom.yu.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: Finding "near"-matches
Date   Thu, 27 Oct 2005 09:59:42 -0400

Friends,

In managing data for projects, I sometimes get requests from the investigators for reports relating to the subjects in a specific list they provide. Typically the list is given as a text or Excel file containing the subjects' unique study identifier (string variable), and the first two letters of the subjects' first and last names (as a redundancy check).

The study identifiers in the list often contain errors, and when I merge them with the file of valid study ID's usually several will come up unmatched. When I then merge on the initials in the file of valid IDs, I almost always can find a "near" match to the ID in the list. Sometimes, though, there will be dozens of IDs with the matching initials and I will have to scan a long list visually to find the near ID. Once I find the near match, I get back to the requester asking them to verify that that is who they actually wanted. I've never been wrong yet, though, rarely, there will be two near-matches that I can't choose between.

But visual scanning for near matches can be slow, and if the list of mismatched IDs is long, and contains people with common initials, it can take a lot of time. I have noticed that most of the incorrect IDs fall into a small number of error categories: confusing letter "O" and number "0", confusing letter "l" and number "1", inverting the order of some digits, omitting one or more leading zeroes, or omitting one or more digits from a repeated sequence (e.g. listing 32257A1 when it should be 322257A1).

I've been thinking of trying to write a program to look for near matches based on this taxonomy of errors. But each of these errors (except dropping leading zero) can occur at any position in the ID string, and typically these strings run 10 to 16 characters. So this leaves a lot of possibilities to check and coding all of that seems ponderous. So I'm wondering if anybody knows of an efficient way to do this, or even an existing program that does this or something close to it. (Getting near matches for names is pretty easy using Soundex, but that won't work for arbitrary alphanumeric ID strings.)

(I know that designing ID numbers with built-in redundancy like an error-correcting code would be a better solution, but typically I don't have control over that: the IDs are usually medical record numbers from various institutions and I have to use them as is.)

Thanks for any suggestions.

Clyde Schechter
Dept. of Family and Social Medicine
Albert Einstein College of Medicine
Bronx, NY, USA

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index