Boost-based regular expressions

Order

<- See Stata 18's new features

Highlights

New Boost-based regular expression functions
Allow different regular expression syntaxes

Regular expressions are powerful tools for working with string data. Stata's regular expressions have become even more powerful, with more features, in Stata 18.

-> Overview

-> Let's see it work

-> Tell me more

Overview

Regular expressions are used for

data validation, for example, to check whether a phone number is well formed;
data extraction, for example, to extract phone numbers from a string; and
data transformation, for example, to normalize different phone number inputs.

Stata provides two sets of regular expression functions: byte-stream-based regexm(), regexr(), and regexs(); and Unicode-based ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). Unicode-based regular expression functions are built on top of ICU libraries.

In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The functions are user-version controlled to retain the old behavior if a user specifies version 17:.

A good discussion of regular expressions in Stata can be found in Asjad Naqvi's Stata guide.

The old implementation is based on Henry Spencer's NFA algorithm and is nearly identical to the POSIX.2 standard. The new implementation in Stata 18 has more features. For example, the new implementation supports {n} for matching a regular expression exactly n times:

. display regexm("123", "\d{3}")
1

. version 17: display regexm("123", "\d{3}")
0

A set of new functions that exclusively use the Boost library have been added:

regexmatch() performs a match of a regular expression to an ASCII string.
regexreplace() replaces the first substring that matches a regular expression with specified text.
regexreplaceall() replaces all substrings that match a regular expression with specified text.
regexcapture() returns a subexpression from a previous match.
regexcapturenamed() returns a subexpression corresponding to a matching named group in a regular expression from a previous match.

Let's see it work

We would like to match and extract phone numbers in the addresses of heads of governments.

We require the following rules:

The phone number follows “Phone:” or “tel:”.
It may start with “+”.
After “+” or at the start, it has 1 to 3 nonzero digits.
After that, it can have anywhere from 7 to 32 digits, space, or “-”.

We would like to generate a variable, phone, for the extracted phone number, which does not contain “Phone:” or “tel:” if the address matches.

We would like to generate another variable, address1, to replace the phone number with the extracted phone number in the above followed by “tel:”.

. input str120 address



                      address
  1. "1600 Pennsylvania Ave., NW Washington, DC 20500 tel:1-202-456-1414 USA"
  2. "Palais de l'Élysée 55 rue du Faubourg-Saint-Honoré 75008 Paris, Phone:+33 1 42 92 81 00 France"
  3. "10 Downing Street, SW1A 2AA +44-20-7925-0918 United kingdom"
  4. "東京都千代田区永田町2丁目3番1号 100-0014, Phone: +81 3-3581-0101, Japan"
  5. end

. local reg "(?:Phone\:[\s]*?|tel\:[\s]*)([\+]{0, 1}[1-9]{1, 3}[0-9\s\-]{7,32})"

. generate match = regexmatch(address, "`reg'")

. generate address1 = regexreplace(address, "`reg'", "tel:$1")

. generate phone = regexcapture(1) if regexmatch(address, "`reg'")
(1 missing value generated)

. list phone


                   phone 
  1.     1-202-456-1414  
  2.  +33 1 42 92 81 00  
  3.                     
  4.     +81 3-3581-0101

Components of the regular expression in the local macro reg are as follows:

(?:Phone\:[\s]*?|tel\:[\s]*)—match either “Phone:” or “tel:” followed by no spaces or some but not capturing the match.
([+]{0, 1}[1-9]{1, 3}[0-9\s-]{7,32})—match and capture a regular expression that satisfies the following:
- [+]{0, 1}—it may start with “+”.
- [1-9]{1, 3}—after “+” or at the start, it has 1 to 3 nonzero digits.
- [0-9\s-]{7,32}—after that, it can have anywhere from 7 to 32 digits, space, or “-”.

We see that the third address does not contain either "Phone:" or "tel:" and thus does not match the regular expression, so phone is missing for this observation.

Tell me more

View all the new features in Stata 18.

Made for data science.

Get started today.

Order

Upgrade

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

2024 Stata Conference · 1-2 August · Portland, OR

View the program →

View the program →