Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: regexs and regexm


From   Simon Falck <sfalckstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   st: regexs and regexm
Date   Thu, 03 Oct 2013 14:22:23 +0200

Dear Statlist,

Using Stata 11.2, I want to extract a portion of a string variable using regular expressions, i.e. -regexs- and -regexm-

This job is a bit tricky because the string variable contains several different types of expressions, lengths, and sometimes spaces, with information that looks something like this,

string variable
UK/FI/EI
PMSE    NO(20)
PMSE    NO(20),EI(5),GE(35),CN(20)
PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)
POLARLIS    FR(220)
LIDAR_GPS    NI(20),NO(20)
IASK    SE(60),NO(20),UK(20)

What I want is to extract (decomposed) information from the string variable into new columns, such as,

var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12
UK        FI        EI
PM       SE       NO        20
PM SE NO 20 EI 5 GE 30 UK 30 SE 30

As I understand, one way of doing this is to use Stata´s regular expressions: -regexs- and -regexm-, i.e.:

gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
..and so on..

However, since the characteristics of the string variable is rich on variety this task appears far more complex than what I first thought, and I am unable to construct a proper script to decompose the string variable in an efficient way.

Any suggestions?

Thanks in advance,
Simon

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index