Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: String search


From   "Rafal Raciborski" <rraciborski@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: String search
Date   Wed, 13 Aug 2008 11:13:57 -0400

try regular expressions matching, for example

. list

     +--------------------------+
     |                    myvar |
     |--------------------------|
  1. | outside the red lion pub |
  2. |                 red lion |
  3. |          in the red lyon |
  4. |                 Red Lion |
  5. |                 red Lyon |
     |--------------------------|
  6. |                 red loon |
     +--------------------------+

. gen found = regexm(myvar, "[r | R]ed [l | L][i | y]?on")

. list

     +----------------------------------+
     |                    myvar   found |
     |----------------------------------|
  1. | outside the red lion pub       1 |
  2. |                 red lion       1 |
  3. |          in the red lyon       1 |
  4. |                 Red Lion       1 |
  5. |                 red Lyon       1 |
     |----------------------------------|
  6. |                 red loon       0 |
     +----------------------------------+









On Wed, Aug 13, 2008 at 9:17 AM, Simon Moore <simoncmoore@gmail.com> wrote:
> Dear Statalist,
>
> I have a string variable that contains values something like this:-
>
> "outside the red lion pub"
> "red lion"
> "in the red lyon"
>
> and so on.
>
> I need to search this variable for names (e.g. "red lion") and would like to
> do so in such a way that overcome the inevitable typo (e.g. "red lyon").
>
> Searching through the statalist archives I have come across, for example:
>
> g pub = 0
> replace pub = 1 if index(lower(var1), "red lion")
>
> But this does not cope well if there's any deviation in spelling.  I also
> came across a rather neat routine written by Laura Giuliano that computes
> the Levenshtein distance and goes something like this:
>
> local word1 = "simon"
> local word2 = "slim"
> local L1 = length("`word1'")
> local L2 = length("`word2'")
>
> matrix A=J(`L2'+1, `L1'+1, 0)
>        forval i = 0 / `L1' {
>                matrix A[1,`i'+1] = `i'
>        }
>        forval j = 1 / `L2' {
>                matrix A[`j'+1,1] = `j'
>        }
>        forval j = 1 / `L2' {
>                forval i = 1 / `L1' {
>                        if  substr("`word2'", `j', 1) == substr("`word1'",
> `i', 1) {
>                                local cost=0
>                        }
>                        else {
>                                local cost=1
>                        }
>                        local m = 1 + A[`j', `i'+1]
>                        local n = 1 + A[`j'+1, `i']
>                        local d = `cost' + A[`j', `i']
>                                matrix A[`j'+1,`i'+1]=min(`m',`n',`d')
>                }
>        }
>                local lev = A[`L2'+1, `L1'+1]
>        di "Levenshtein distance between `word1' and `word2' is `lev' "
>
>
> This would be great, except that my string variable has the odd additional,
> and redundant, word thrown in.
>
> So, would anyone happen to know if there's a routine that kind of combines
> both index and Levenshtein to provide some measure of text is definitely or
> nearly definitely in the string variable?  For example, a score of 0 if "red
> lion" is present, 1 if "red lyon" is present and so on.
>
> As ever, any guidance greatly appreciated.
>
> Regards
> Simon
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index