Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Eric Booth <eric.a.booth@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Comparing strings |

Date |
Mon, 26 Mar 2012 12:02:57 -0500 |

<> Also, note that with -reclink- you can use the 'exclude()' and/or 'exactstr()' options to "loop" over your datasets and match on different criteria each time (so, find the nearest match where the first letter matches (if you used 'exactstr' you'd store that first letter in another variable with the substr() string function), then match if the first two letters matched, and so on -- and let -reclink- handle the fuzzy match for the rest of the string (though you'd want to calibrate the matching tolerance using the minbigram() or minscore() options). Another option to consider is working with the package -strgroup- from SSC. It has similar functionality for matching strings in the same variable within the same dataset -- but you could use it for data combination by appending the datasets, running -strgroup- on your merge var, and then spliting the dataset again so that you can merge on the group/match variable -strgroup- creates. ((With some strings (especially strings that have many words), I've had better success matching with -strgroup- than -reclink-, but it may have been my error in specifying the -reclink- matching options.)) - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu +979.845.6754 On Mar 26, 2012, at 11:45 AM, Nick Cox wrote: > I agree with Eric. > > The problem of fuzzy matches is immensely more difficult than that of > exact matches, not least because of the difficulty of defining > (exactly!) what the problem is. But I doubt that fuzziness usually > implies that anagrams are allowed and as acceptable as the original, > so you are best off looking in other directions. > > Nick > > On Mon, Mar 26, 2012 at 5:36 PM, Eric Booth <eric.a.booth@gmail.com> wrote: > >> Take a look at: -findit reclink- > > On Mar 26, 2012, at 11:34 AM, jo la frite wrote: > >>> Thanks >>> for your reply and sorry for being too cryptic with my question. >>> >>> I am >>> trying to merge 2 datasets, in which observations (firms) are identified by >>> their names. The names do not match exactly in the 2 datasets so I am doing a >>> "fuzzy match". My idea is to match 2 names if a large enough fraction >>> of a name from dataset 1 (say name1) is in a name from dataset 2 (name 2). For >>> example, "abcde" could be matched with "abcdtyuk" because >>> the FIRST 4 letters are in common out of an average of (5+8)/2=6.5. It is >>> important that the comparison sticks to the ordering of the letters. So >>> "abcde" is not matched with "edcba" or "bacde", >>> even though the letters are the same but in a different order. Does that make >>> any sense? thanks again for your help. > > From: Nick Cox <njcoxstata@gmail.com> > >>> This is a bit better: >>> >>> mata : >>> >>> string scalar strscalarsort(string scalar mystring) { >>> >>> real scalar len, i >>> string colvector work >>> len = strlen(mystring) >>> work = J(len, 1, "") >>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1) >>> _sort(work, 1) >>> return(invtokens(work', "")) >>> } >>> >>> end >>> >>> I still don't know what the real problem is, so I am just playing. But >>> if you wanted to compare strings regardless of order of characters >>> something like this would seem needed as a first step. > > On Mon, Mar 26, 2012 at 1:51 AM, Nick Cox <njcoxstata@gmail.com> wrote: > >>>> -indexnot()- is a function, not a command. >>>> >>>> It's not clear to me what you want, but you can check for whether the >>>> same letters occur in two strings, at the cost of some programming. >>>> For example, a Mata function can be written to sort the characters of >>>> a string scalar into alphabetical order. Here is one: >>>> >>>> mata : >>>> >>>> string scalar deorst(string scalar mystring) { >>>> >>>> real scalar len >>>> string vector work >>>> len = strlen(mystring) >>>> work = J(len, 1, "") >>>> for(i = 1; i <= len; i++) work[i] = substr(mystring, i, 1) >>>> _sort(work, 1) >>>> mystring = "" >>>> for(i = 1; i <= len; i++) mystring = mystring + work[i] >>>> return(mystring) >>>> } >>>> >>>> end >>>> >>>> . mata : deorst("sorted") >>>> deorst >>>> >>>> . mata : deorst("backwards") >>>> aabcdkrsw >>>> >>>> On Sun, Mar 25, 2012 at 10:20 PM, jo la frite <jo_la_frite@yahoo.com> wrote: >>>>> thanks Nick and Eric. As far as I understand, the indexnot command compares strings regardless of the ordering of the characters in the string. for example, "frog" and "ogfr" are viewed as identical by indexnot. >>>>> >>>>> >>>>> Is there a way of controling for the ordering of the characters. So for example, "comparing "frog" and "fragro" retuns 3 (position of the first character from "frog" not in "fragro"). >>>> >>>> From: Nick Cox <njcoxstata@gmail.com> >>>> >>>>> Stata naturally does have a concept of alphanumeric order for strings; >>>>> otherwise it could not -sort- them. Consider >>>>> >>>>> . di ("frog" < "toad") >>>>> 1 >>>>> >>>>> . di ("frog" < "foo") >>>>> 0 >>>>> >>>>> The first statement is true and the second false. Otherwise put, with >>>>> strings < means "precedes" and > means "follows" in alphanumeric >>>>> order. >>>>> >>>>> This allows one step further forwards: >>>>> >>>>> gen compare = cond(str1 > str2, indexnot(str1, str2), -indexnot(str1, str2)) >>>>> >>>>> If strings are identical, this yields 0. Jo did not make explicit that >>>>> this is what SAS does too, but either way it seems logical to me. >>>>> >>>>> Nick >>>>> >>>>> On Sat, Mar 24, 2012 at 10:47 PM, Eric Booth <eric.a.booth@gmail.com> wrote: >>>>> >>>>>> Take a look at the string function (-help string_functions-) indexnot() (e.g., "gen x = indexnot(string1, string2)" ) which will give you the leftmost position where the two strings differ. >>>>>> This Stata string function does not assign the positive/negative sign like the sas function you describe, but you can code those yourself by using other string functions to find how they differ in order/sequence/length. >>>>> >>>>> On Mar 24, 2012, at 5:12 PM, jo la frite wrote: >>>>> >>>>>>> Is there a Stata function that correspondons to the Sas function "COMPARE". It allows to compare strings. Specifically, in Sas COMPARE(string-1, string-2) returns a numeric value. The sign of the result is negative if string-1 precedes string-2 in a sort sequence, and positive if string-1 follows string-2 in a sort sequence. The magnitude of the result is equal to the position of the leftmost character at which the strings differ. > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**References**:**st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**Re: st: Comparing strings***From:*Eric Booth <eric.a.booth@gmail.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Comparing strings***From:*jo la frite <jo_la_frite@yahoo.com>

**Re: st: Comparing strings***From:*Eric Booth <eric.a.booth@gmail.com>

**Re: st: Comparing strings***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: oaxaca2** - Next by Date:
**st: RE: statalist-digest V4 #4467** - Previous by thread:
**Re: st: Comparing strings** - Next by thread:
**Re: st: Comparing strings** - Index(es):