[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data management issue (names listed differently)

From   Gawrich Stefan <>
To   "''" <>
Subject   Re: st: data management issue (names listed differently)
Date   Thu, 3 Jul 2008 14:34:02 +0200

Rufus Peabody wrote: 
I am still wondering how I can merge with a variable that has a mixture of
CorrectSpelling and WrongSpelling. Cleaning it up manually is extremely
time-consuming since there are thousands of observations.


In cases where manual cleaning up of names or adress data is not feasible,
probabilistic deduplication (finds perfect and imperfect matches in one
file) or record linkage (finds perfect or imperfect matches in two files) is
the method of choice. The term "Probabilistic Record linkage" covers both.
In contrast there is deterministic record linkage (only perfect matches)
which is implemented in Stata e.g. in the duplicates and (m)merge command.

Probabilistic Linkage uses string comparison methods (e.g. substrings like
bigrams) and frequency analysis to calculate a score as an indicator for the
likelihood of a match. The result is often a bimodal score distribution with
many small numbers (near certain non-matches), some middle numbers
(uncertain matches, need manual check) and some more high numbers
(near-certain matches). 

In Stata there is to my knowlege the reclink ado by Michael Blasnik which
supports Record Linkage, but not unduplication. There is an older ado
"nmatch" by Peter Sasieni (STB-26 July 1995) that covers some imperfect name
writing issues. But outside of Stata there are quite a lot (a few of them
free) Software programs for probabilistic record linkage availiable (Link
King with SAS, febrl, Link Plus and others) which support unduplication.  

As Eva Poen wrote, the first step is always data cleansing. The next step is
fitting the data to get best possible results with the chosen record Linkage
algorithm. For example, creating a state var by extraction of state
information like "(FL)" and "Florida" from the string and standardizing it
could help. This is feasible only if the state information is given regulary
and in not too many different modes. With matching names one often uses a
variable with the phonetic code in order to overcome typos.  But even the
best string comparator wouldn't identify "USC" and "Southern Cal" as one
university. So before Linkage it would be good to look at all abbrevations
(stringlength < x) or (wordlength < x) and standardize the content. When
names differ in order ("Xtown University" or "University of Xtown") it may
be necessary to split the string into word-variables and skip the "of". Some
Record Linkage Software has array-analysis techniques implemented to find
matches in an array of variables. In this cases the manual assignment might
be time-consuming and error-prone, because the strings are somewhere
distributed in the sorted dataset. 

In the given case each variable that helps to identify the university (like
the name of the football team) should also be used as a linkage var (no
matter how many bears, hawks and bulls teams are present in the dataset).

The same algorithm should also be used with other datasets before merging to
the first dataset.

Probabilistic Record Linkage/Deduplication is a science of it's own and can
be time-consuming to set up. But with huge datasets there is simply no other
way to cope with imperfect "dirty" data. 

Best wishes

Stefan Gawrich


*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index