Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Question: Fuzzy/approximate string matching


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Question: Fuzzy/approximate string matching
Date   Mon, 09 Feb 2004 09:08:42 -0600

Brendan Miller <Brendan_Miller@harvard.edu> asked about how to do a 
"fuzzy merge"

> [...] based on a string field that contains organization names.
> Unfortunately, the names are not listed equivalently in both databases (e.g.
> "The Miller Corporation" in one vs. "Miller Corp." in the other).

What Brendan wants is a "fuzzy/approximate string matching function" that will
do what he is thinking.  I know of no such function and, even if it existed, I
would not recommend he trust it.  Instead, I recommend Brendan do the match 
himself, tailoring the rules to his particular problem.


The basic method
----------------

At each step of the way, we are going to have three datasets:

      resid1.dta             resid2.dta:              matched.dta:
      -----------------      -----------------        ---------------
      Unmatched records      Unmatched records         Matched 
       from 1st dataset       from 2nd dataset         records
      -----------------      -----------------        ----------------

At each step, we are going to look at resid1.dta and resid2.dta, spot 
some matching rule (such as change "Corp" to Corporation" and then match), 
and turn the crank:  matching what we can from resid1 and resid2, adding 
the matches to matched.dta, and updating resid1.dta and resid2.dta to contain
the new leftovers.

One of the advantages of this method is that, once we are down 
to small numbers of unmatched observations, we can apply rules 
highly specific to our problem.


Getting Started
---------------

Our first step will be an exact match.  Let's assume the original datasets 
to be matched are called left.dta and right.dta:

        . use right.dta
        . sort name 
        . save, replace

        . use left.dta
        . sort name
        . merge name using right

        . sort name
        . save result                /* <- just temporarily */

        . keep if _merge==3
        . drop _merge
        . save matched

        . use result
        . keep if _merge==1
        . keep name
        . merge name using left
        . keep if _merge==3
        . drop _merge
        . save resid1 

        . use result
        . keep if _merge==2
        . keep name
        . mege name using right 
        . keep if merge==3
        . drop _merge
        . save resid1 

        . erase result

We are going to ahve to do something like that each iteration, and that is too
much typing for us every to get right time after time.  Moreover, the next
time, we are going to have to add to matched.dta, and that is just going to
complicate the issue.  So let's write a program:

        program merge3 
                confirm file matched.dta
                confirm file resid1.dta 
                confirm file resid2.dta 

                use resid1, clear 
                merge name using resid2 
                if _N==0 {
                        exit
                }
                save result

                keep if _merge==3
                drop _merge
                append using matched
                save matched, replace

                use result, clear 
                keep if _merge==1
                keep name
                merge name using resid1
                keep if _merge==3
                drop _merge
                sort name
                save resid1, replace emptyok

                use result, clear 
                keep if _merge==2
                keep name
                merge name using resid2
                keep if _merge==3
                drop _merge
                sort name
                save resid2, replace emptyok

                erase result.dta
        end

Starting over
-------------

With that, let's start over.  First, we need to do the setup to run our 
program:

        . clear
        . save matched, emptyok

        . use left
        . sort name
        . save resid1

        . use right
        . sort name
        . save resid2

Now let's do the exact match:

        . merge3

Step 2
------

At this point, we have the three datasets.  Look at resid1.dta and resid2.dta.
Spot a problem.  Let's pretend we see things like "Stata Corp" and Stata
Corp.", and "ABC, Inc." and "ABC Inc".  

Let's get rid of periods and commads in the name:

        . use resid1, clear 
        . replace name = subinstr(name, ".", "", .)
        . replace name = subinstr(name, ",", "", .)
        . sort name
        . save, replace

        . use resid2, clear 
        . replace name = subinstr(name, ".", "", .)
        . replace name = subinstr(name, ",", "", .)
        . sort name
        . save, replace

Now we can perform the second step of the merge:

        . merge3 

Basically, we keep working like that until we have merged all the 
observations we think reasonable.


-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index