[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Simon Moore <simoncmoore@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: String search |

Date |
Wed, 13 Aug 2008 14:17:24 +0100 |

Dear Statalist,

I have a string variable that contains values something like this:-

"outside the red lion pub"

"red lion"

"in the red lyon"

and so on.

I need to search this variable for names (e.g. "red lion") and would like to do so in such a way that overcome the inevitable typo (e.g. "red lyon").

Searching through the statalist archives I have come across, for example:

g pub = 0

replace pub = 1 if index(lower(var1), "red lion")

But this does not cope well if there's any deviation in spelling. I also came across a rather neat routine written by Laura Giuliano that computes the Levenshtein distance and goes something like this:

local word1 = "simon"

local word2 = "slim"

local L1 = length("`word1'")

local L2 = length("`word2'")

matrix A=J(`L2'+1, `L1'+1, 0)

forval i = 0 / `L1' {

matrix A[1,`i'+1] = `i'

}

forval j = 1 / `L2' {

matrix A[`j'+1,1] = `j'

}

forval j = 1 / `L2' {

forval i = 1 / `L1' {

if substr("`word2'", `j', 1) == substr("`word1'", `i', 1) {

local cost=0

}

else {

local cost=1

}

local m = 1 + A[`j', `i'+1]

local n = 1 + A[`j'+1, `i']

local d = `cost' + A[`j', `i']

matrix A[`j'+1,`i'+1]=min(`m',`n',`d')

}

}

local lev = A[`L2'+1, `L1'+1]

di "Levenshtein distance between `word1' and `word2' is `lev' "

This would be great, except that my string variable has the odd additional, and redundant, word thrown in.

So, would anyone happen to know if there's a routine that kind of combines both index and Levenshtein to provide some measure of text is definitely or nearly definitely in the string variable? For example, a score of 0 if "red lion" is present, 1 if "red lyon" is present and so on.

As ever, any guidance greatly appreciated.

Regards

Simon

*

* For searches and help try:

* http://www.stata.com/help.cgi?search

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: String search***From:*"Rafal Raciborski" <rraciborski@gmail.com>

**Re: st: String search***From:*"Scott Merryman" <scott.merryman@gmail.com>

**References**:**st: tobit?***From:*"Mona Mowafi" <mmowafi@hsph.harvard.edu>

**Re: st: tobit?***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**RE: st: tobit?***From:*"Kieran McCaul" <kamccaul@meddent.uwa.edu.au>

**Re: st: tobit?***From:*Steven Samuels <sjhsamuels@earthlink.net>

- Prev by Date:
**st: help in obtaining a bar graph** - Next by Date:
**st: xtmelogit** - Previous by thread:
**Re: st: tobit?** - Next by thread:
**Re: st: String search** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |