Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Matching Names


From   David Bell <dcbell@iupui.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: Matching Names
Date   Fri, 8 Aug 2008 08:56:42 -0400

Dear Max,

I agree that 52000 is a lot of cases. I've never had to deal with that many, but your method depends on your tolerance of bad/missed matches. In my case, where we had to decide if a person named by respondent A is the same as a person with a similar name described by respondent B (where we had recorded gender, race/ethnicity, approximate age), we found that all mechanical matching algorithms were pretty bad.

My advice is to do it in two parts. Match as much as you can (say using -soundex- as Kieran suggests), then eyeball the matches (along with any auxiliary information you have on them (like age, gender, or whatever is in your dataset). Then do the same for the non-match side. It's a lot of work, but you only have to do it once. Of course, if you don't care about mismatches, go the mechanical route. I'd still eyeball at least a subset to get a mismatch error rate estimate.

Dave
====================================
David C. Bell
Professor of Sociology
Indiana University Purdue University Indianapolis (IUPUI)
(317) 278-1336
====================================




On Aug 7, 2008, at 6:10 PM, Kieran McCaul wrote:


This is a big problem.

You might want to investigate using soundex to help with matching the misspelt names but, depending on the version of soundex that you use, it may not be particularly useful.

Michael Blasnik wrote an egen function to implement a soundex algorithm a while ago for Stata 7.
http://ideas.repec.org/c/boc/bocode/s420901.html

You could try that.




______________________________________________
Kieran McCaul MPH PhD
WA Centre for Health & Ageing (M573)
University of Western Australia
Level 6, Ainslie House
48 Murray St
Perth 6000
Phone: (08) 9224-2140
Phone: -61-8-9224-2140
email: kamccaul@meddent.uwa.edu.au
http://myprofile.cos.com/mccaul
_______________________________________________


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu ] On Behalf Of Max Perez Leon
Sent: Friday, 8 August 2008 5:03 AM
To: statalist@hsphsun2.harvard.edu
Subject: st: Matching Names


Hello statalist users,

I am having a big problem trying to merge to datasets with names. The problem is
that there are tons of typos in both datasets. Examples bellow:

DATASET 1: --------------------- DATASET 2:

NAMES--------------------------- NAMES

LUIS PÉREZ --------------------- LUIS P´REZ
WILLIAM SMITH ------------------ WILLIAM SMITHSS
JORGE F. CHOCAN ---------------- JORGE F CHOCANOS
P. BROWN ----------------------- PAUL BROWN
ENRIQUETA GAUDENCIA------------- ENRIQUETA G

I could do it by hand but I have 52568 obs and more to come. I am trying to
establish a method using regular expressions so that I can merge correctly the
datasets.
Any help will be very much appreciated,

Thanks for your time,
Max Perez Leon
PUCP-IEP




*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/



*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index