[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Matching Names

From   David Bell <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: RE: Matching Names
Date   Fri, 8 Aug 2008 08:56:42 -0400

Dear Max,

I agree that 52000 is a lot of cases. I've never had to deal with that many, but your method depends on your tolerance of bad/missed matches. In my case, where we had to decide if a person named by respondent A is the same as a person with a similar name described by respondent B (where we had recorded gender, race/ethnicity, approximate age), we found that all mechanical matching algorithms were pretty bad.

My advice is to do it in two parts. Match as much as you can (say using -soundex- as Kieran suggests), then eyeball the matches (along with any auxiliary information you have on them (like age, gender, or whatever is in your dataset). Then do the same for the non-match side. It's a lot of work, but you only have to do it once. Of course, if you don't care about mismatches, go the mechanical route. I'd still eyeball at least a subset to get a mismatch error rate estimate.

David C. Bell
Professor of Sociology
Indiana University Purdue University Indianapolis (IUPUI)
(317) 278-1336

On Aug 7, 2008, at 6:10 PM, Kieran McCaul wrote:

This is a big problem.

You might want to investigate using soundex to help with matching the misspelt names but, depending on the version of soundex that you use, it may not be particularly useful.

Michael Blasnik wrote an egen function to implement a soundex algorithm a while ago for Stata 7.

You could try that.

Kieran McCaul MPH PhD
WA Centre for Health & Ageing (M573)
University of Western Australia
Level 6, Ainslie House
48 Murray St
Perth 6000
Phone: (08) 9224-2140
Phone: -61-8-9224-2140
email: [email protected]

-----Original Message-----
From: [email protected] [mailto:[email protected] ] On Behalf Of Max Perez Leon
Sent: Friday, 8 August 2008 5:03 AM
To: [email protected]
Subject: st: Matching Names

Hello statalist users,

I am having a big problem trying to merge to datasets with names. The problem is
that there are tons of typos in both datasets. Examples bellow:

DATASET 1: --------------------- DATASET 2:

NAMES--------------------------- NAMES

LUIS P�REZ --------------------- LUIS P�REZ
P. BROWN ----------------------- PAUL BROWN

I could do it by hand but I have 52568 obs and more to come. I am trying to
establish a method using regular expressions so that I can merge correctly the
Any help will be very much appreciated,

Thanks for your time,
Max Perez Leon

* For searches and help try:

* For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index