Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: identifying last name/first name inversions in administrative datasets


From   Pierre Azoulay <pierre.azoulay@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: identifying last name/first name inversions in administrative datasets
Date   Wed, 12 Oct 2005 08:53:55 -0400

Dear Stata Listers,

I have a data management problem. I am dealing with an administrative
dataset (patent data). I am trying to assign unique identifiers to
individual patenters. The particular problem that I am trying to
tackle is that of first name/last name inversion. Consider the
following records:

id	   patent_id	   fname       lname
686	  6576756	Knuth	    Alexander
686	  6525177	Knuth	    Alexander
686	  6252052	Knuth	    Alexander
686	  6297364	Knuth	    Alexander
802408	6339140	     Alexander	Knuth
802408	6830924	     Alexander	Knuth
802408	6723832	     Alexander	Knuth
802408	6774226	     Alexander	Knuth
802408	6774226	     Alexander	Knuth
802408	6723832	     Alexander	Knuth
802408	6830924	     Alexander	Knuth
802408	6339140	    Alexander	Knuth

Now it's obvious that Knuth Alexander (id=802408) and Alexander Knuth
(id=686) are one and the same person. The question is how to identify
potential inversions in the data (we have 60,000 distinct identifiers
in the data).

My current thought was to write:

egen fl=group(fname lname)
egen x=nvals(id), by(fl)
keep if x>1

This does not work because egen fl=group(fname lname) and egen
fl=group(lname fname) return different sets of identifiers. I would
want to use an equivalent of egengroup() that is indifferent to the
order of the arguments. Then it would work, and I could flag all the
potentially problematic records.

Could someone help me out?

Thanks

Pierre
----------------------------------------------------------------------------------
Pierre Azoulay
Associate Professor of Management
Columbia University                           Phone: (212) 854-9684
Graduate School of Business             Fax:   Don't send any
3022 Broadway, Uris Hall 704
New York, NY 10023                        http://www.columbia.edu/~pa2009/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index