# st: String search

 From Simon Moore <[email protected]> To [email protected] Subject st: String search Date Wed, 13 Aug 2008 14:17:24 +0100

Dear Statalist,

I have a string variable that contains values something like this:-

"outside the red lion pub"
"red lion"
"in the red lyon"

and so on.

I need to search this variable for names (e.g. "red lion") and would like to do so in such a way that overcome the inevitable typo (e.g. "red lyon").

Searching through the statalist archives I have come across, for example:

g pub = 0
replace pub = 1 if index(lower(var1), "red lion")

But this does not cope well if there's any deviation in spelling. I also came across a rather neat routine written by Laura Giuliano that computes the Levenshtein distance and goes something like this:

local word1 = "simon"
local word2 = "slim"
local L1 = length("`word1'")
local L2 = length("`word2'")

matrix A=J(`L2'+1, `L1'+1, 0)
forval i = 0 / `L1' {
matrix A[1,`i'+1] = `i'
}
forval j = 1 / `L2' {
matrix A[`j'+1,1] = `j'
}
forval j = 1 / `L2' {
forval i = 1 / `L1' {
if substr("`word2'", `j', 1) == substr("`word1'", `i', 1) {
local cost=0
}
else {
local cost=1
}
local m = 1 + A[`j', `i'+1]
local n = 1 + A[`j'+1, `i']
local d = `cost' + A[`j', `i']
matrix A[`j'+1,`i'+1]=min(`m',`n',`d')
}
}
local lev = A[`L2'+1, `L1'+1]
di "Levenshtein distance between `word1' and `word2' is `lev' "

This would be great, except that my string variable has the odd additional, and redundant, word thrown in.

So, would anyone happen to know if there's a routine that kind of combines both index and Levenshtein to provide some measure of text is definitely or nearly definitely in the string variable? For example, a score of 0 if "red lion" is present, 1 if "red lyon" is present and so on.

As ever, any guidance greatly appreciated.

Regards
Simon

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/