[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Frank de Libero" <fedmerchant@comcast.net> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: regular expressions in Stata |

Date |
Tue, 13 Sep 2005 11:08:01 -0700 |

Scott wrote: Does anyone know how regular expressions are implemented in Stata? ...... Off the list, Kevin Turner, StataCorp, emailed me the following, which answers Scott's question: Getting more to the technical details, the areas that our RE parser is not POSIX compliant are: 1) No support for what is called a 'bound', which is the curly brace {#} that denotes a count of items to be matched. 2) No support for character classes within bracket expressions. [:alnum:] [:digit:] [:alpha:] are all examples. This is also very similar to Perl's use of \w \W \s etc. to denote character classes. I don't believe Perl's syntax is POSIX, however. I would have to double-check that. 3) Any obscure syntax rules that relate to brackets, but as I read the spec, these are usually the result of character classes. Stata's RE parser (which is a derived from Spencer's), has all of the basic, RE syntax items: 1) Atoms for matching zero or more, 1 or more, or one or none: *+? 2) Subexpressions denoted by parenthesis. Btw, subexpression 0 will always return the entire string matched by the RE string. 3) Branches, which are denoted with pipes: | 4) Atoms for beginning of line and end of line: ^$ 5) Atom for matching any character, which is represented as a period. 6) Support for 'escaping' any reserved character with a backslash. For example, denoting a literal dollar sign could be done with \$ 7) Support for bracket expressions, which are used to list a collection of valid characters to match. [0-9a-z] is an example. [abc] is another. So, to sum it up, the few areas where we are not POSIX compliant are really in, what I would term, 'shortcut syntax' of the POSIX specification. In other words, you may not have a counting syntax with curly braces but you can list out the long form of the RE to match the number you wish. Also, you might not have a shortcut class for all alphanumeric characters with [:alnum:] but you can certainly write the long form, which is [0-9a-zA-Z]. ..Frank * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: regular expressions in Stata***From:*scott hankins <scott.hankins@gmail.com>

- Prev by Date:
**Re: st: RE: regular expressions in Stata** - Next by Date:
**st: Updating Outreg** - Previous by thread:
**st: regular expressions in Stata** - Next by thread:
**st: RE: regular expressions in Stata** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |