Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing strings


From   jo la frite <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: Comparing strings
Date   Mon, 26 Mar 2012 13:30:13 -0700 (PDT)

thanks to both of you. I will experiment with strgroup and reclink.
Jo



----- Original Message -----
From: Eric Booth <[email protected]>
To: [email protected]
Cc: 
Sent: Monday, March 26, 2012 7:02 PM
Subject: Re: st: Comparing strings

<>
Also, note that with -reclink- you can use the 'exclude()' and/or 'exactstr()' options to "loop" over your datasets and match on different criteria each time (so, find the nearest match where the first letter matches (if you used 'exactstr' you'd store that first letter in another variable with the substr() string function), then match if the first two letters matched, and so on -- and let -reclink- handle the fuzzy match for the rest of the string (though you'd want to calibrate the matching tolerance using the minbigram() or minscore() options).

Another option to consider is working with the package -strgroup- from SSC.  It has similar functionality for matching strings in the same variable within the same dataset -- but you could use it for data combination by appending the datasets, running -strgroup- on your merge var, and then spliting the dataset again so that you can merge on the group/match variable -strgroup- creates.  ((With some strings (especially strings that have many words), I've had better success matching with -strgroup- than -reclink-, but it may have been my error in specifying the -reclink- matching options.))


- Eric

__
Eric A. Booth
Public Policy Research Institute 
Texas A&M University
[email protected]
+979.845.6754

On Mar 26, 2012, at 11:45 AM, Nick Cox wrote:

> I agree with Eric.
> 
> The problem of fuzzy matches is immensely more difficult than that of
> exact matches, not least because of the difficulty of defining
> (exactly!)  what the problem is. But I doubt that fuzziness usually
> implies that anagrams are allowed and as acceptable as the original,
> so you are best off looking in other directions.
> 
> Nick
> 
> On Mon, Mar 26, 2012 at 5:36 PM, Eric Booth <[email protected]> wrote:
> 
>> Take a look at: -findit reclink-
> 
> On Mar 26, 2012, at 11:34 AM, jo la frite wrote:
> 
>>> Thanks
>>> for your reply and sorry for being too cryptic with my question.
>>> 
>>> I am
>>> trying to merge 2 datasets, in which observations (firms) are identified by
>>> their names. The names do not match exactly in the 2 datasets so I am doing a
>>> "fuzzy match". My idea is to match 2 names if a large enough fraction
>>> of a name from dataset 1 (say name1) is in a name from dataset 2 (name 2). For
>>> example, "abcde" could be matched with "abcdtyuk" because
>>> the FIRST 4 letters are in common out of an average of (5+8)/2=6.5. It is
>>> important that the comparison sticks to the ordering of the letters. So
>>> "abcde" is not matched with "edcba" or "bacde",
>>> even though the letters are the same but in a different order. Does that make
>>> any sense? thanks again for your help.
> 
> From: Nick Cox <[email protected]>
> 
>>> This is a bit better:
>>> 
>>> mata :
>>> 
>>> string scalar strscalarsort(string scalar mystring) {
>>> 
>>> real scalar len, i
>>> string colvector work
>>> len = strlen(mystring)
>>> work = J(len, 1, "")
>>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
>>> _sort(work, 1)
>>> return(invtokens(work', ""))
>>> }
>>> 
>>> end
>>> 
>>> I still don't know what the real problem is, so I am just playing. But
>>> if you wanted to compare strings regardless of order of characters
>>> something like this would seem needed as a first step.
> 
> On Mon, Mar 26, 2012 at 1:51 AM, Nick Cox <[email protected]> wrote:
> 
>>>> -indexnot()- is a function, not a command.
>>>> 
>>>> It's not clear to me what you want, but you can check for whether the
>>>> same letters occur in two strings, at the cost of some programming.
>>>> For example, a Mata function can be written to sort the characters of
>>>> a string scalar into alphabetical order. Here is one:
>>>> 
>>>> mata :
>>>> 
>>>> string scalar deorst(string scalar mystring) {
>>>> 
>>>> real scalar len
>>>> string vector work
>>>> len = strlen(mystring)
>>>> work = J(len, 1, "")
>>>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
>>>> _sort(work, 1)
>>>> mystring = ""
>>>> for(i = 1; i <= len; i++) mystring = mystring + work[i]
>>>> return(mystring)
>>>> }
>>>> 
>>>> end
>>>> 
>>>> . mata : deorst("sorted")
>>>>  deorst
>>>> 
>>>> . mata : deorst("backwards")
>>>>  aabcdkrsw
>>>> 
>>>> On Sun, Mar 25, 2012 at 10:20 PM, jo la frite <[email protected]> wrote:
>>>>> thanks Nick and Eric. As far as I understand, the indexnot command compares strings regardless of the ordering of the characters in the string. for example, "frog" and "ogfr" are viewed as identical by indexnot.
>>>>> 
>>>>> 
>>>>> Is there a way of controling for the ordering of the characters. So for example, "comparing "frog" and "fragro" retuns 3 (position of the first character from "frog" not in "fragro").
>>>> 
>>>> From: Nick Cox <[email protected]>
>>>> 
>>>>> Stata naturally does have a concept of alphanumeric order for strings;
>>>>> otherwise it could not -sort- them. Consider
>>>>> 
>>>>> . di ("frog" < "toad")
>>>>> 1
>>>>> 
>>>>> . di ("frog" < "foo")
>>>>> 0
>>>>> 
>>>>> The first statement is true and the second false. Otherwise put, with
>>>>> strings < means "precedes" and > means "follows" in alphanumeric
>>>>> order.
>>>>> 
>>>>> This allows one step further forwards:
>>>>> 
>>>>> gen compare = cond(str1 > str2, indexnot(str1, str2), -indexnot(str1, str2))
>>>>> 
>>>>> If strings are identical, this yields 0. Jo did not make explicit that
>>>>> this is what SAS does too, but either way it seems logical to me.
>>>>> 
>>>>> Nick
>>>>> 
>>>>> On Sat, Mar 24, 2012 at 10:47 PM, Eric Booth <[email protected]> wrote:
>>>>> 
>>>>>> Take a look at the string function (-help string_functions-) indexnot() (e.g., "gen x = indexnot(string1, string2)" )  which will give you the leftmost position where the two strings differ.
>>>>>> This Stata string function does not assign the positive/negative sign like the sas function you describe, but you can code those yourself by using other string functions to find how they differ in order/sequence/length.
>>>>> 
>>>>> On Mar 24, 2012, at 5:12 PM, jo la frite wrote:
>>>>> 
>>>>>>> Is there a Stata function that correspondons to the Sas function "COMPARE". It allows to compare strings. Specifically, in Sas COMPARE(string-1, string-2) returns a numeric value. The sign of the result is negative if string-1 precedes string-2 in a sort sequence, and positive if string-1 follows string-2 in a sort sequence. The magnitude of the result is equal to the position of the leftmost character at which the strings differ.
> 
> *
> *   For searches and help try:
> *  http://www.stata.com/help.cgi?search
> *  http://www.stata.com/support/statalist/faq
> *  http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*  http://www.stata.com/help.cgi?searchhttp://www.stata.com/support/statalist/faqhttp://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index