Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexs and regexm


From   Robert Picard <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: regexs and regexm
Date   Thu, 3 Oct 2013 09:16:50 -0400

You can use -moss- (from SSC) to split your string variable using a
regex pattern. Here are two ways of splitting your string:

Robert

* ----------------- begin example ---------------
clear
input str80 s
"UK/FI/EI"
"PMSE    NO(20)"
"PMSE    NO(20),EI(5),GE(35),CN(20)"
"PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)"
"POLARLIS    FR(220)"
"LIDAR_GPS    NI(20),NO(20)"
"IASK    SE(60),NO(20),UK(20)"
end

* match any sequence of 2 chars or number
moss s, match("([A-Z][A-Z]|[0-9][0-9])") regex

* match anything that is not a delimiter
moss s, match("([^ \(\),/]+)") regex pre(v_)
* ----------------- end example -----------------


On Thu, Oct 3, 2013 at 8:22 AM, Simon Falck <[email protected]> wrote:
> Dear Statlist,
>
> Using Stata 11.2, I want to extract a portion of a string variable using
> regular expressions, i.e. -regexs- and -regexm-
>
> This job is a bit tricky because the string variable contains several
> different types of expressions, lengths, and sometimes spaces, with
> information that looks something like this,
>
> string variable
> UK/FI/EI
> PMSE    NO(20)
> PMSE    NO(20),EI(5),GE(35),CN(20)
> PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)
> POLARLIS    FR(220)
> LIDAR_GPS    NI(20),NO(20)
> IASK    SE(60),NO(20),UK(20)
>
> What I want is to extract (decomposed) information from the string variable
> into new columns, such as,
>
> var1     var2    var3     var4    var5    var6    var7    var8 var9    var10
> var11    var12
> UK        FI        EI
> PM       SE       NO        20
> PM       SE       NO        20        EI        5        GE 30        UK
> 30        SE        30
>
> As I understand, one way of doing this is to use Stata´s regular
> expressions: -regexs- and -regexm-, i.e.:
>
> gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
> gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
> ..and so on..
>
> However, since the characteristics of the string variable is rich on variety
> this task appears far more complex than what I first thought, and I am
> unable to construct a proper script to decompose the string variable in an
> efficient way.
>
> Any suggestions?
>
> Thanks in advance,
> Simon
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index