Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexs and regexm


From   Simon Falck <[email protected]>
To   [email protected]
Subject   Re: st: regexs and regexm
Date   Thu, 03 Oct 2013 15:42:07 +0200

Robert,

Thank you for this excellent suggestion. I tried -moss- and it does the job.

All the best,
Simon



On 2013-10-03 15:16, Robert Picard wrote:
You can use -moss- (from SSC) to split your string variable using a
regex pattern. Here are two ways of splitting your string:

Robert

* ----------------- begin example ---------------
clear
input str80 s
"UK/FI/EI"
"PMSE    NO(20)"
"PMSE    NO(20),EI(5),GE(35),CN(20)"
"PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)"
"POLARLIS    FR(220)"
"LIDAR_GPS    NI(20),NO(20)"
"IASK    SE(60),NO(20),UK(20)"
end

* match any sequence of 2 chars or number
moss s, match("([A-Z][A-Z]|[0-9][0-9])") regex

* match anything that is not a delimiter
moss s, match("([^ \(\),/]+)") regex pre(v_)
* ----------------- end example -----------------


On Thu, Oct 3, 2013 at 8:22 AM, Simon Falck <[email protected]> wrote:
Dear Statlist,

Using Stata 11.2, I want to extract a portion of a string variable using
regular expressions, i.e. -regexs- and -regexm-

This job is a bit tricky because the string variable contains several
different types of expressions, lengths, and sometimes spaces, with
information that looks something like this,

string variable
UK/FI/EI
PMSE    NO(20)
PMSE    NO(20),EI(5),GE(35),CN(20)
PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)
POLARLIS    FR(220)
LIDAR_GPS    NI(20),NO(20)
IASK    SE(60),NO(20),UK(20)

What I want is to extract (decomposed) information from the string variable
into new columns, such as,

var1     var2    var3     var4    var5    var6    var7    var8 var9    var10
var11    var12
UK        FI        EI
PM       SE       NO        20
PM       SE       NO        20        EI        5        GE 30        UK
30        SE        30

As I understand, one way of doing this is to use Stata´s regular
expressions: -regexs- and -regexm-, i.e.:

gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
..and so on..

However, since the characteristics of the string variable is rich on variety
this task appears far more complex than what I first thought, and I am
unable to construct a proper script to decompose the string variable in an
efficient way.

Any suggestions?

Thanks in advance,
Simon

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index