Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing strings


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Comparing strings
Date   Mon, 26 Mar 2012 17:45:12 +0100

I agree with Eric.

The problem of fuzzy matches is immensely more difficult than that of
exact matches, not least because of the difficulty of defining
(exactly!)  what the problem is. But I doubt that fuzziness usually
implies that anagrams are allowed and as acceptable as the original,
so you are best off looking in other directions.

Nick

On Mon, Mar 26, 2012 at 5:36 PM, Eric Booth <eric.a.booth@gmail.com> wrote:

> Take a look at: -findit reclink-

On Mar 26, 2012, at 11:34 AM, jo la frite wrote:

>> Thanks
>> for your reply and sorry for being too cryptic with my question.
>>
>> I am
>> trying to merge 2 datasets, in which observations (firms) are identified by
>> their names. The names do not match exactly in the 2 datasets so I am doing a
>> "fuzzy match". My idea is to match 2 names if a large enough fraction
>> of a name from dataset 1 (say name1) is in a name from dataset 2 (name 2). For
>> example, "abcde" could be matched with "abcdtyuk" because
>> the FIRST 4 letters are in common out of an average of (5+8)/2=6.5. It is
>> important that the comparison sticks to the ordering of the letters. So
>> "abcde" is not matched with "edcba" or "bacde",
>> even though the letters are the same but in a different order. Does that make
>> any sense? thanks again for your help.

From: Nick Cox <njcoxstata@gmail.com>

>> This is a bit better:
>>
>> mata :
>>
>> string scalar strscalarsort(string scalar mystring) {
>>
>> real scalar len, i
>> string colvector work
>> len = strlen(mystring)
>> work = J(len, 1, "")
>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
>> _sort(work, 1)
>> return(invtokens(work', ""))
>> }
>>
>> end
>>
>> I still don't know what the real problem is, so I am just playing. But
>> if you wanted to compare strings regardless of order of characters
>> something like this would seem needed as a first step.

On Mon, Mar 26, 2012 at 1:51 AM, Nick Cox <njcoxstata@gmail.com> wrote:

>>> -indexnot()- is a function, not a command.
>>>
>>> It's not clear to me what you want, but you can check for whether the
>>> same letters occur in two strings, at the cost of some programming.
>>> For example, a Mata function can be written to sort the characters of
>>> a string scalar into alphabetical order. Here is one:
>>>
>>> mata :
>>>
>>> string scalar deorst(string scalar mystring) {
>>>
>>> real scalar len
>>> string vector work
>>> len = strlen(mystring)
>>> work = J(len, 1, "")
>>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
>>> _sort(work, 1)
>>> mystring = ""
>>> for(i = 1; i <= len; i++) mystring = mystring + work[i]
>>> return(mystring)
>>> }
>>>
>>> end
>>>
>>> . mata : deorst("sorted")
>>>  deorst
>>>
>>> . mata : deorst("backwards")
>>>  aabcdkrsw
>>>
>>> On Sun, Mar 25, 2012 at 10:20 PM, jo la frite <jo_la_frite@yahoo.com> wrote:
>>>> thanks Nick and Eric. As far as I understand, the indexnot command compares strings regardless of the ordering of the characters in the string. for example, "frog" and "ogfr" are viewed as identical by indexnot.
>>>>
>>>>
>>>> Is there a way of controling for the ordering of the characters. So for example, "comparing "frog" and "fragro" retuns 3 (position of the first character from "frog" not in "fragro").
>>>
>>> From: Nick Cox <njcoxstata@gmail.com>
>>>
>>>> Stata naturally does have a concept of alphanumeric order for strings;
>>>> otherwise it could not -sort- them. Consider
>>>>
>>>> . di ("frog" < "toad")
>>>> 1
>>>>
>>>> . di ("frog" < "foo")
>>>> 0
>>>>
>>>> The first statement is true and the second false. Otherwise put, with
>>>> strings < means "precedes" and > means "follows" in alphanumeric
>>>> order.
>>>>
>>>> This allows one step further forwards:
>>>>
>>>> gen compare = cond(str1 > str2, indexnot(str1, str2), -indexnot(str1, str2))
>>>>
>>>> If strings are identical, this yields 0. Jo did not make explicit that
>>>> this is what SAS does too, but either way it seems logical to me.
>>>>
>>>> Nick
>>>>
>>>> On Sat, Mar 24, 2012 at 10:47 PM, Eric Booth <eric.a.booth@gmail.com> wrote:
>>>>
>>>>> Take a look at the string function (-help string_functions-) indexnot() (e.g., "gen x = indexnot(string1, string2)" )  which will give you the leftmost position where the two strings differ.
>>>>> This Stata string function does not assign the positive/negative sign like the sas function you describe, but you can code those yourself by using other string functions to find how they differ in order/sequence/length.
>>>>
>>>> On Mar 24, 2012, at 5:12 PM, jo la frite wrote:
>>>>
>>>>>> Is there a Stata function that correspondons to the Sas function "COMPARE". It allows to compare strings. Specifically, in Sas COMPARE(string-1, string-2) returns a numeric value. The sign of the result is negative if string-1 precedes string-2 in a sort sequence, and positive if string-1 follows string-2 in a sort sequence. The magnitude of the result is equal to the position of the leftmost character at which the strings differ.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index