Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing strings


From   Eric Booth <eric.a.booth@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Comparing strings
Date   Mon, 26 Mar 2012 11:36:28 -0500

<>

Take a look at: -findit reclink-

- Eric


__
Eric A. Booth
Public Policy Research Institute 
Texas A&M University
ebooth@ppri.tamu.edu
+979.845.6754

On Mar 26, 2012, at 11:34 AM, jo la frite wrote:

> dear
> Nick,
>  
> Thanks
> for your reply and sorry for being too cryptic with my question.
>  
> I am
> trying to merge 2 datasets, in which observations (firms) are identified by
> their names. The names do not match exactly in the 2 datasets so I am doing a
> "fuzzy match". My idea is to match 2 names if a large enough fraction
> of a name from dataset 1 (say name1) is in a name from dataset 2 (name 2). For
> example, "abcde" could be matched with "abcdtyuk" because
> the FIRST 4 letters are in common out of an average of (5+8)/2=6.5. It is
> important that the comparison sticks to the ordering of the letters. So
> "abcde" is not matched with "edcba" or "bacde",
> even though the letters are the same but in a different order. Does that make
> any sense? thanks again for your help.
>  
> Jo
> 
> 
> ________________________________
> From: Nick Cox <njcoxstata@gmail.com>
> To: statalist@hsphsun2.harvard.edu 
> Sent: Monday, March 26, 2012 11:30 AM
> Subject: Re: st: Comparing strings
> 
> This is a bit better:
> 
> mata :
> 
> string scalar strscalarsort(string scalar mystring) {
> 
> real scalar len, i
> string colvector work
> len = strlen(mystring)
> work = J(len, 1, "")
> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
> _sort(work, 1)
> return(invtokens(work', ""))
> }
> 
> end
> 
> I still don't know what the real problem is, so I am just playing. But
> if you wanted to compare strings regardless of order of characters
> something like this would seem needed as a first step.
> 
> On Mon, Mar 26, 2012 at 1:51 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> -indexnot()- is a function, not a command.
>> 
>> It's not clear to me what you want, but you can check for whether the
>> same letters occur in two strings, at the cost of some programming.
>> For example, a Mata function can be written to sort the characters of
>> a string scalar into alphabetical order. Here is one:
>> 
>> mata :
>> 
>> string scalar deorst(string scalar mystring) {
>> 
>> real scalar len
>> string vector work
>> len = strlen(mystring)
>> work = J(len, 1, "")
>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1)
>> _sort(work, 1)
>> mystring = ""
>> for(i = 1; i <= len; i++) mystring = mystring + work[i]
>> return(mystring)
>> }
>> 
>> end
>> 
>> . mata : deorst("sorted")
>>  deorst
>> 
>> . mata : deorst("backwards")
>>  aabcdkrsw
>> 
>> On Sun, Mar 25, 2012 at 10:20 PM, jo la frite <jo_la_frite@yahoo.com> wrote:
>>> thanks Nick and Eric. As far as I understand, the indexnot command compares strings regardless of the ordering of the characters in the string. for example, "frog" and "ogfr" are viewed as identical by indexnot.
>>> 
>>> 
>>> Is there a way of controling for the ordering of the characters. So for example, "comparing "frog" and "fragro" retuns 3 (position of the first character from "frog" not in "fragro").
>> 
>> From: Nick Cox <njcoxstata@gmail.com>
>> 
>>> Stata naturally does have a concept of alphanumeric order for strings;
>>> otherwise it could not -sort- them. Consider
>>> 
>>> . di ("frog" < "toad")
>>> 1
>>> 
>>> . di ("frog" < "foo")
>>> 0
>>> 
>>> The first statement is true and the second false. Otherwise put, with
>>> strings < means "precedes" and > means "follows" in alphanumeric
>>> order.
>>> 
>>> This allows one step further forwards:
>>> 
>>> gen compare = cond(str1 > str2, indexnot(str1, str2), -indexnot(str1, str2))
>>> 
>>> If strings are identical, this yields 0. Jo did not make explicit that
>>> this is what SAS does too, but either way it seems logical to me.
>>> 
>>> Nick
>>> 
>>> On Sat, Mar 24, 2012 at 10:47 PM, Eric Booth <eric.a.booth@gmail.com> wrote:
>>> 
>>>> Take a look at the string function (-help string_functions-) indexnot() (e.g., "gen x = indexnot(string1, string2)" )  which will give you the leftmost position where the two strings differ.
>>>> This Stata string function does not assign the positive/negative sign like the sas function you describe, but you can code those yourself by using other string functions to find how they differ in order/sequence/length.
>>> 
>>> On Mar 24, 2012, at 5:12 PM, jo la frite wrote:
>>> 
>>>>> Is there a Stata function that correspondons to the Sas function "COMPARE". It allows to compare strings. Specifically, in Sas COMPARE(string-1, string-2) returns a numeric value. The sign of the result is negative if string-1 precedes string-2 in a sort sequence, and positive if string-1 follows string-2 in a sort sequence. The magnitude of the result is equal to the position of the leftmost character at which the strings differ.
> 
> *
> *   For searches and help try:
> *  http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/   
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index