Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: matching observations for merging


From   "Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: matching observations for merging
Date   Thu, 17 Jun 2010 09:02:06 -0700

Almost - in a similar application, I frequently need to sort on physician name - so there may be a bunch of docs.  Unfortunately, there is often no consistency - one time I may see (to use a Statalist contributor, who has never been one of these) WolfeF, WolfF, FWolfe, Fwolfe, etc.  This doesn't account for misspellings and typos.  The idea of sorting by name will go far, but with many names and no standardization of how to enter the name there's a lot of work to be done.  Maarten's idea will be useful to many.

These are often studies from medical records, so there is limited control on spelling, etc.

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten buis
Sent: Thursday, June 17, 2010 8:56 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: matching observations for merging

--- On Thu, 17/6/10, Abhimanyu Arora wrote:
> I have to files to be merged. Is it possible to merge using
> an approximation of the merging variable? In other words, if
> my merging variable is say, country, there could be a slight change in
> spelling of some countries (Afghanistan/ Afganistan) in the two
> files...Is there a more efficient way than just going through all 200+
> countries and checking spelling consistency?

For countries the quickest way is to 
1) keep in each dataset one observation per country
2) merge the 2 datasets
3) keep if _merge != 3 
4) sort on country name
5) list

This will display a list of troublesome country names, which is
usually so short that it doesn't pay to do anything more fancy.

With this list you can create a recode .do file which harmonizes
country names before the final merge. 

Moreover, this harmonization do file can be a good starting position 
in any subsequent project involving the merge on country names, as the
kind of inconsistencies in country names are pretty similar across 
files. So at the begining of each project you start by running the 
harmonization do-file of the last project, than go through steps 1-5 
to find any mismatches that weren't handeld in the last do-file, and 
add those to your new harmonization file. After 4 or 5 projects you 
will hardly find any mismatch anymore.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------


      

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index