Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: regular expressions in Stata


From   "Frank de Libero" <[email protected]>
To   <[email protected]>
Subject   st: RE: regular expressions in Stata
Date   Tue, 13 Sep 2005 11:08:01 -0700

Scott wrote:

Does anyone know how regular expressions are implemented in Stata? 
......

Off the list, Kevin Turner, StataCorp, emailed me the following, which
answers Scott's question:

Getting more to the technical details, the areas that our RE parser is
not 
POSIX compliant are:

	1) No support for what is called a 'bound', which is the curly
brace
	   {#} that denotes a count of items to be matched. 
	2) No support for character classes within bracket expressions. 
		[:alnum:] 	[:digit:]	[:alpha:]      
	   are all examples. This is also very similar to Perl's use of
\w \W
	   \s etc. to denote character classes. I don't believe Perl's 
	   syntax is POSIX, however. I would have to double-check that.
	3) Any obscure syntax rules that relate to brackets, but as I
read the 
	   spec, these are usually the result of character classes.

Stata's RE parser (which is a derived from Spencer's), has all of the
basic, 
RE syntax items:

	1) Atoms for matching zero or more, 1 or more, or one or none:
*+?
	2) Subexpressions denoted by parenthesis. Btw, subexpression 0
will 
	   always return the entire string matched by the RE string. 
	3) Branches, which are denoted with pipes: |
	4) Atoms for beginning of line and end of line: ^$
	5) Atom for matching any character, which is represented as a
period.
	6) Support for 'escaping' any reserved character with a
backslash. 
	   For example, denoting a literal dollar sign could be done
with \$		7) Support for bracket expressions, which are used to
list a collection
	   of valid characters to match. [0-9a-z] is an example. [abc]
is 
	   another.

So, to sum it up, the few areas where we are not POSIX compliant are
really 
in, what I would term, 'shortcut syntax' of the POSIX specification. In
other words, you may not have a counting syntax with curly braces but
you can list out the long form of the RE to match the number you wish.
Also, you might not have a shortcut class for all alphanumeric
characters with [:alnum:] but you can certainly write the long form,
which is [0-9a-zA-Z].


..Frank

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index