Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identifying duplicate records


From   "William Gould, StataCorp LP" <wgould@stata.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: identifying duplicate records
Date   Mon, 13 Feb 2012 10:21:08 -0600

Raoul Reulen <r.c.reulen@gmail.com> wrote, 

> I have a large database with around 300,000 records of individuals.
> There can be more than one record per individual.  Now, how do I
> identify individuals? I assume that it is the same indivual if:
>
>       Date of birth and NHS number are the same  OR
>       date of birth and surname are the same OR
>       surname and NHS number are the same.
>
> [...]
>
> What is the best way to approach this?  I want to end up with an
> id-number that identifies the individual.  

This question has already generated good replies, I'm late to the 
conversation, but I still want to contribute my thoughts.  

I'm doing that because this question is more generic than many of 
you may realize.  Have you hever had to merge two datasets with 
less than perfect identification data?  It's the same problem.

I have two points I wish to make,

    1.  How to solve the problem from first principles (i.e., basic 
        Stata commands), and 

    2.  How better to process data with matches that are not certain.

The reason to solve the problem from first principles is that it forces 
us to think the problem through more clearly.  In this case, we will 
hit upon a problem and discover the part 2 is not optional. 

But I get ahead of myself. 


1.  First principles 
--------------------

Assume the variables are 

        dob           date of birth 
        nhsid         NHS number 
        surname       surname 

We will create new variable id.  Raoul offered three rules:

    1.  Date of birth and NHS number are the same  
    2.  date of birth and surname are the same 
    3.  surname and NHS number are the same.

So I'm going to create three new identification identification variables -- 
id1, id2, and id3 -- based on the three rules. 

The basic solution is, 

        . gen id1 = .                           // rule 1
        . sort dob nhsid
        . by dob nhsid: replace id1 = 1 if _n==1
        . replace id1 = sum(id1)

        . gen id2 = .                           // rule 2
        . sort dob surname 
        . by dob surname: replace id2 = 1 if _n==1
        . replace id2 = sum(id2)

        . gen id3 = .                           // rule 3
        . sort surname nhsid
        . by surname nhsid: replace id3 = 1 if _n==1
        . replace id3 = sum(id3)

I now have three identifiation variables and I now realize that they 
may be inconsistent. Consider the following examples:

         nhsid        dob            surname
         1939488      15jan1953      smith
         1939488      15jan1953      jones
         2234278      15jan1953      smith
         and so on

Call those three examples E1, E2, and E3.  By rule 1, E1 and E2 match.
By rule 2, E2 and E3 match.

We have a problem.


A better solution 
-----------------

We need to write a matching rules more carefully and we need to put
them into an order from most to least certain.  In this case, the
rules I suggest are,

       R1.  nhsid, dob, and surname all match
       R2.  only nhsid and dob match
       R3.  only nhsid and surname match
       R4.  only surname and dob match
       R5.  only nhsid matches
       R6.  if none of the above, we will say they are different persons.

The prticular rules do not matter -- that's for Raoul to decide -- as 
is the order.  Bear with me, because it will eventually become easy to 
change the rules and their order.

We are going to create two new variables in our data:

        reason   the reason the observations were matched (1, 2, ..., 6)
        id       id number within reason

and from that we will then generate another new variable 

        finalid  over all id number 

Assume the original data is called master.dta.  We proceed like this:

        . use master                        // use original 
                                            // create id var, rule 1
        . gen id = .
        . sort nhsid dob surname
        . by nhsid dob surname: gen id = 1
        . replace id = sum(id)
                                            // split data in two
        . sort id 
        . preserve
        . by id: keep if _N>1
        . gen reason = 1
        . save reason1                      // ... those who matched 

        . restore
        . by id: keep if _N==1
        . save therest, replace             // ... and the rest

Having finished with the rule-1 group, we can continue with therest.dta 
and apply rule 2:

        . use therest                       // use the rest 
                                            // create id var, rule 2
        . gen id = .
        . sort nhsid dob
        . by nhsid dob: gen id = 1
        . replace id = sum(id)
                                            // split data in two
        . sort id 
        . preserve
        . by id: keep if _N>1
        . gen reason = 2
        . save reason2                      // ... those who matched 

        . restore
        . by id: keep if _N==1
        . save therest, replace             // ... and the rest

And so we will continue.  This is too much typing for me, so I'm going 
to make a program of it to save myself typing and reduce the chances of 
error.

        ------------------------------------------------ myfile.do -----
        clear all

        * in the program below, I added option -emptyok- to -save-
        * in case any of the reasons result in 0 matches.

        program doit 
                args reason_no vars

                use therest                 // use the rest 
                                            // create id var
                gen id = .
                sort `vars'
                by `vars': gen id = 1
                replace id = sum(id)
                                            // split data in two
                sort id 
                preserve
                by id: keep if _N>1
                gen reason = `reason_no'
                                               // ... those who matched
                save reason`reason_no', replace emptyok
                restore
                by id: keep if _N==1
                save therest, replace emptyok   // ... and the rest
        end 

        use master, clear 
        save therest, replace

        doit 1 "nhsid dob surname"
        doit 2 "nhsid dob"
        doit 3 "nhsid surname"
        doit 4 "surname dob"
        doit 5 "nhsid"

        * Now handle the leftovers, who are unmatched:

        gen id = _n 
        gen reason = 6 
        save reason6, replace emptyok
        ------------------------------------------------ myfile.do -----

I can now create the first-round results by typing 

        . do myfile

The result is that I will have new datasets reason1.dta, ..., reason6.dta.

If my do-file has an error, I can fix my do-file and run it again. 
I will have updated datasets reason1.dta, ..., reason6.dta. 
	
I would look for poor matches based on data errors in each of the
datasets -- especially reason2.dta though reason5.dta.  Based on that,
I might change the variables my rules are based on, change the order
of the rules, or even repair the original variables nhsid, dob, and
surname in the original master.dta!

     Aside:  I would never actually change master.dta.  I would add 
     lines to my do-file to -replace- values with better values 
     before saving therest.dta for the first time. 

I can change my do-file, type -do myfile-, and I have my new result.

Eventually, I would put the data together thusly, 

        . do myfile 
        . gen finalid = .
        . sort reason id
        . by reason id: replace finalid = 1 if _n==1
        . replace finalid = sum(id)


Closing comments
----------------

Anyone who deals with problematic matches needs to learn this general 
approach.

    1.  This general approach above is applied to an "append-ish" 
        problem, but it works with merge problems as well; only the 
        programming details differ.

    2.  The general appraoch is

        2.1  Match observations according the the rule that you are 
             certain is correct.  Get those matches out of the way by 
             saving them in a separate dataset.  Your problem is now to 
             deal with the rest. 

        2.2  Match the remaining observations according to the rule you 
             are next most certain about.  Get those observations out of 
             the way by saving them in a separate datset.  Your problem 
             is now to deal with the rest.

        2.3  Repeat step 2.2 as long as many times as you feel comfortable 
             repeating it. 

        2.4  Look at the results each step of the way.  I matched on 
             nhsid and dob second because I figured mistyping a date of 
             birth was *MORE* likely than mistyping a surname.  Therefore, 
             matches based on nhsid and dob are more likely to be correct 
             than are matches on nhsid and surname.

        2.5  Write your code so that it is easy to change the order in 
             which the rules are applied because sometimes, after looking 
             at the results, you will find your initial suspicions were 
             wrong.

        2.6  As you look at the data, consider adding more rules, or 
             eliminating rules that produce more noise than you expected.
             In this case, if Raoul has sex, I suggest he add sex to 
             to the list of variables on which he is matching.  In that 
             case, rule 1 would be match on nhsid, dob, surname, and sex, 
             and more rules would be added, some including sex, some 
             excluding it.

        2.7  Note that in my code, adding a rule just means adding one  
             line to my do-file.  Reordering the the rules just means 
             changing the order of the -doit- lines, and possibly the 
             reason number. 

        2.8  Also note that I can run my code from the beginning.  I play 
             interactively, but when I'm producing final datasets, I know 
             exactly how I produced them.  The proof is that I can 
             reproduce them.

        2.9  Note that I know why I matched observations.  

I hope this helps.

Don't be surprised if I turn this into a http://blog.stata.com entry.
If I do, I'll edit to write more clearly.  Right now, I wanted to get 
this into Raoul's hands.

-- Bill
wgould@stat.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index