[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Rafal Raciborski" <rraciborski@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: String search |

Date |
Wed, 13 Aug 2008 11:13:57 -0400 |

try regular expressions matching, for example . list +--------------------------+ | myvar | |--------------------------| 1. | outside the red lion pub | 2. | red lion | 3. | in the red lyon | 4. | Red Lion | 5. | red Lyon | |--------------------------| 6. | red loon | +--------------------------+ . gen found = regexm(myvar, "[r | R]ed [l | L][i | y]?on") . list +----------------------------------+ | myvar found | |----------------------------------| 1. | outside the red lion pub 1 | 2. | red lion 1 | 3. | in the red lyon 1 | 4. | Red Lion 1 | 5. | red Lyon 1 | |----------------------------------| 6. | red loon 0 | +----------------------------------+ On Wed, Aug 13, 2008 at 9:17 AM, Simon Moore <simoncmoore@gmail.com> wrote: > Dear Statalist, > > I have a string variable that contains values something like this:- > > "outside the red lion pub" > "red lion" > "in the red lyon" > > and so on. > > I need to search this variable for names (e.g. "red lion") and would like to > do so in such a way that overcome the inevitable typo (e.g. "red lyon"). > > Searching through the statalist archives I have come across, for example: > > g pub = 0 > replace pub = 1 if index(lower(var1), "red lion") > > But this does not cope well if there's any deviation in spelling. I also > came across a rather neat routine written by Laura Giuliano that computes > the Levenshtein distance and goes something like this: > > local word1 = "simon" > local word2 = "slim" > local L1 = length("`word1'") > local L2 = length("`word2'") > > matrix A=J(`L2'+1, `L1'+1, 0) > forval i = 0 / `L1' { > matrix A[1,`i'+1] = `i' > } > forval j = 1 / `L2' { > matrix A[`j'+1,1] = `j' > } > forval j = 1 / `L2' { > forval i = 1 / `L1' { > if substr("`word2'", `j', 1) == substr("`word1'", > `i', 1) { > local cost=0 > } > else { > local cost=1 > } > local m = 1 + A[`j', `i'+1] > local n = 1 + A[`j'+1, `i'] > local d = `cost' + A[`j', `i'] > matrix A[`j'+1,`i'+1]=min(`m',`n',`d') > } > } > local lev = A[`L2'+1, `L1'+1] > di "Levenshtein distance between `word1' and `word2' is `lev' " > > > This would be great, except that my string variable has the odd additional, > and redundant, word thrown in. > > So, would anyone happen to know if there's a routine that kind of combines > both index and Levenshtein to provide some measure of text is definitely or > nearly definitely in the string variable? For example, a score of 0 if "red > lion" is present, 1 if "red lyon" is present and so on. > > As ever, any guidance greatly appreciated. > > Regards > Simon > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: tobit?***From:*"Mona Mowafi" <mmowafi@hsph.harvard.edu>

**Re: st: tobit?***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**RE: st: tobit?***From:*"Kieran McCaul" <kamccaul@meddent.uwa.edu.au>

**Re: st: tobit?***From:*Steven Samuels <sjhsamuels@earthlink.net>

**st: String search***From:*Simon Moore <simoncmoore@gmail.com>

- Prev by Date:
**Re: st: Graph bar with stack** - Next by Date:
**RE: st: number format in graph labels** - Previous by thread:
**Re: st: String search** - Next by thread:
**Re: st: tobit?** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |