Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: st: RE: Merging by string variables


From   "Martin Weiss" <martin.weiss1@gmx.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   AW: st: RE: Merging by string variables
Date   Tue, 1 Dec 2009 16:52:42 +0100

<> 



*************
replace var1 = ltrim(var1)
replace var1 = rtrim(var1)
*************


could be telescoped into - replace var1 = trim(var1)-


HTH
Martin

-----Ursprüngliche Nachricht-----
Von: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] Im Auftrag von Eric Booth
Gesendet: Dienstag, 1. Dezember 2009 16:53
An: statalist@hsphsun2.harvard.edu
Betreff: Re: st: RE: Merging by string variables

>
>

Also, I should add a note of warning that -reclink- may help with some
approximate matching, but you really need to do some clean-up of the string
variables, as Nick suggests.

>From the short snippet of your dataset, it looks like the strings you are
using to merge are similar in the first word(s) and the first word(s) of
each matched pair are distinct from other matched pairs.  So, you might be
able to get -reclink- to save you some work if you first truncate the
strings to the first 10 or so characters, trim spaces from your strings,
and change all your characters to lowercase.  Depending on what the rest of
your dataset looks like, this may not get you too close to 100% matching,
but at least it might help flag the strings that you need clean up by hand:


*****
clear
cap ssc install reclink
**

input str44 var1     var2                                               
"68th precinct youth council inc   "    1                      
"action center for education and community development, inc "  2
"amistad child day care and family center inc"      3          
end

replace var1 = lower(var1)
replace var1 = ltrim(var1)
replace var1 = rtrim(var1)
replace var1 = substr(var1, 1, 10)
sort var1
gen idusing = _n
save "datasetusing.dta", replace


clear
input str44 var1 var3
" 68th precinct youth council, inc."  3
"action center for education and community development-ps 106"  5
" amistad early childhood educational center inc"  5
end

replace var1 = lower(var1)
replace var1 = ltrim(var1)
replace var1 = rtrim(var1)
replace var1 = substr(var1, 1, 10)
sort var1
gen idmaster = _n
save "datasetmaster.dta", replace

//merge with reclink//
reclink var1 using "datasetusing.dta", idmaster(idmaster) idusing(idusing)
gen(_match) minscore(.75)
li var1 Uvar1 _match

*******


Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754



On Dec 1, 2009, at 9:15 AM, Nick Cox wrote:

> There's no easy and failsafe solution here. 
> 
> -merge- doesn't know about meanings or approximate matches. It's entirely
literal. 
> 
> I can think of two strategies. 
> 
> 1. You just need to work on one or indeed both datasets to produce
variables that will merge. There's detailed advice within 
> 
> SJ-8-3  dm0039  . . .  Stata tip 64: Cleaning up user-entered string
variables
>        . . . . . . . . . . . . . . . . . . . . . . . .  J. Herrin and E.
Poen
>        Q3/08   SJ 8(3):444--445                                 (no
commands)
>        tip on how to clean up user-entered string variables
> 
> 2. You could try soundex or similar tricks. Your example doesn't look
encouraging for that strategy. 
> 
> Nick 
> n.j.cox@durham.ac.uk 
> 
> Meryle Weinstein, Ph.D.
> 
> I have two datasets that Im trying to merge by the following string
> variables:  agencyname sitename siteaddress.  There are slight differences
> in the datasets, particularly in the agencyname and sitename variables so
> I'm having trouble merging the two datasets. The problem seems to be that
> the agencyname differs slightly in each of the datasets.  For example 
> 
> Dataset2                                                     dataset2
> 68th precinct youth council inc                              68th precinct
> youth council, inc.
> action center for education and community development, inc   action center
> for education and community development-ps 106
> amistad child day care and family center inc                 amistad early
> childhood educational center inc
> 
> Any suggestions on a way to merge by these three variables would be
> appreciated.
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/





*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index