Dear Statalist,

I have a string variable that contains values something like this:-

"outside the red lion pub"

"red lion"

"in the red lyon"

and so on.

I need to search this variable for names (e.g. "red lion") and would like to do so in such a way that overcome the inevitable typo (e.g. "red lyon").

Searching through the statalist archives I have come across, for example:

g pub = 0

replace pub = 1 if index(lower(var1), "red lion")

But this does not cope well if there's any deviation in spelling. I also came across a rather neat routine written by Laura Giuliano that computes the Levenshtein distance and goes something like this:

local word1 = "simon"

local word2 = "slim"

local L1 = length("`word1'")

local L2 = length("`word2'")

matrix A=J(`L2'+1, `L1'+1, 0)

forval i = 0 / `L1' {

matrix A[1,`i'+1] = `i'

}

forval j = 1 / `L2' {

matrix A[`j'+1,1] = `j'

}

forval j = 1 / `L2' {

forval i = 1 / `L1' {

if substr("`word2'", `j', 1) == substr("`word1'", `i', 1) {

local cost=0

}

else {

local cost=1

}

local m = 1 + A[`j', `i'+1]

local n = 1 + A[`j'+1, `i']

local d = `cost' + A[`j', `i']

matrix A[`j'+1,`i'+1]=min(`m',`n',`d')

}

}

local lev = A[`L2'+1, `L1'+1]

di "Levenshtein distance between `word1' and `word2' is `lev' "

This would be great, except that my string variable has the odd additional, and redundant, word thrown in.

So, would anyone happen to know if there's a routine that kind of combines both index and Levenshtein to provide some measure of text is definitely or nearly definitely in the string variable? For example, a score of 0 if "red lion" is present, 1 if "red lyon" is present and so on.

As ever, any guidance greatly appreciated.

Regards

Simon

