Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Eric Booth <eric.a.booth@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Comparing strings |

Date |
Mon, 26 Mar 2012 11:36:28 -0500 |

<> Take a look at: -findit reclink- - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu +979.845.6754 On Mar 26, 2012, at 11:34 AM, jo la frite wrote: > dear > Nick, > > Thanks > for your reply and sorry for being too cryptic with my question. > > I am > trying to merge 2 datasets, in which observations (firms) are identified by > their names. The names do not match exactly in the 2 datasets so I am doing a > "fuzzy match". My idea is to match 2 names if a large enough fraction > of a name from dataset 1 (say name1) is in a name from dataset 2 (name 2). For > example, "abcde" could be matched with "abcdtyuk" because > the FIRST 4 letters are in common out of an average of (5+8)/2=6.5. It is > important that the comparison sticks to the ordering of the letters. So > "abcde" is not matched with "edcba" or "bacde", > even though the letters are the same but in a different order. Does that make > any sense? thanks again for your help. > > Jo > > > ________________________________ > From: Nick Cox <njcoxstata@gmail.com> > To: statalist@hsphsun2.harvard.edu > Sent: Monday, March 26, 2012 11:30 AM > Subject: Re: st: Comparing strings > > This is a bit better: > > mata : > > string scalar strscalarsort(string scalar mystring) { > > real scalar len, i > string colvector work > len = strlen(mystring) > work = J(len, 1, "") > for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1) > _sort(work, 1) > return(invtokens(work', "")) > } > > end > > I still don't know what the real problem is, so I am just playing. But > if you wanted to compare strings regardless of order of characters > something like this would seem needed as a first step. > > On Mon, Mar 26, 2012 at 1:51 AM, Nick Cox <njcoxstata@gmail.com> wrote: >> -indexnot()- is a function, not a command. >> >> It's not clear to me what you want, but you can check for whether the >> same letters occur in two strings, at the cost of some programming. >> For example, a Mata function can be written to sort the characters of >> a string scalar into alphabetical order. Here is one: >> >> mata : >> >> string scalar deorst(string scalar mystring) { >> >> real scalar len >> string vector work >> len = strlen(mystring) >> work = J(len, 1, "") >> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1) >> _sort(work, 1) >> mystring = "" >> for(i = 1; i <= len; i++) mystring = mystring + work[i] >> return(mystring) >> } >> >> end >> >> . mata : deorst("sorted") >> deorst >> >> . mata : deorst("backwards") >> aabcdkrsw >> >> On Sun, Mar 25, 2012 at 10:20 PM, jo la frite <jo_la_frite@yahoo.com> wrote: >>> thanks Nick and Eric. As far as I understand, the indexnot command compares strings regardless of the ordering of the characters in the string. for example, "frog" and "ogfr" are viewed as identical by indexnot. >>> >>> >>> Is there a way of controling for the ordering of the characters. So for example, "comparing "frog" and "fragro" retuns 3 (position of the first character from "frog" not in "fragro"). >> >> From: Nick Cox <njcoxstata@gmail.com> >> >>> Stata naturally does have a concept of alphanumeric order for strings; >>> otherwise it could not -sort- them. Consider >>> >>> . di ("frog" < "toad") >>> 1 >>> >>> . di ("frog" < "foo") >>> 0 >>> >>> The first statement is true and the second false. Otherwise put, with >>> strings < means "precedes" and > means "follows" in alphanumeric >>> order. >>> >>> This allows one step further forwards: >>> >>> gen compare = cond(str1 > str2, indexnot(str1, str2), -indexnot(str1, str2)) >>> >>> If strings are identical, this yields 0. Jo did not make explicit that >>> this is what SAS does too, but either way it seems logical to me. >>> >>> Nick >>> >>> On Sat, Mar 24, 2012 at 10:47 PM, Eric Booth <eric.a.booth@gmail.com> wrote: >>> >>>> Take a look at the string function (-help string_functions-) indexnot() (e.g., "gen x = indexnot(string1, string2)" ) which will give you the leftmost position where the two strings differ. >>>> This Stata string function does not assign the positive/negative sign like the sas function you describe, but you can code those yourself by using other string functions to find how they differ in order/sequence/length. >>> >>> On Mar 24, 2012, at 5:12 PM, jo la frite wrote: >>> >>>>> Is there a Stata function that correspondons to the Sas function "COMPARE". It allows to compare strings. Specifically, in Sas COMPARE(string-1, string-2) returns a numeric value. The sign of the result is negative if string-1 precedes string-2 in a sort sequence, and positive if string-1 follows string-2 in a sort sequence. The magnitude of the result is equal to the position of the leftmost character at which the strings differ. > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**Re: st: Comparing strings***From:*Eric Booth <eric.a.booth@gmail.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

- Prev by Date:
**Re: st: Comparing strings** - Next by Date:
**Re: st: Comparing strings** - Previous by thread:
**Re: st: Comparing strings** - Next by thread:
**Re: st: Comparing strings** - Index(es):