Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm

From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: regexm
Date   Sat, 27 Aug 2011 15:33:44 +0100

Strings longer than 244 characters cannot be read into variables. You could read them into Mata.

As said, do look at -moss-.


On 27 Aug 2011, at 15:22, KOTa <[email protected]> wrote:

simplier in logistics way. i.e. i tried to do the whole thing withot
creating additional variables (that split creates) in the middle.

another question, if you know. also about strings. when i import file
to stata (from excel, for example) i have some very long strings, that
stata cuts to 244 chars.

is there any trick to go around it? except making them shorter before
importing :)

thank you

2011/8/27 Nick Cox <[email protected]>:
Better in what sense? Quicker to get a solution? Simpler? Other criteria?

I don't know a way of counting more than 9 matches directly. I think
you would need, if you continue to follow that path, to loop over a
string repeatedly finding new instances and counting.

See also -moss- from SSC.


On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:
yes, i do work now with split, just thought with regex it will be better.

anyway, is there a way to find out how many expressions regexm finds?
1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
but if i dont know how many there are -> i dont know which one is
2. what if more the 9 expressions found? according to manual regexs
only can have 0-9 parameters.


2011/8/27 Nick Cox <[email protected]>:
Well, you did say "it always ends by "% th_aft".

I will continue as I started.

If you first blank out stuff you don't need then you can just use
-split- to separate out elements. If you parse on spaces then it is
immaterial when you have 2 or 3 digits before, you retrieve the number
either way.

No need for regex demonstrated.


On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
thanks Eric, Nick I used your advices and almost finished.

but encountered one small problems on the way.

i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%- $2(B) 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
not constant

and i am trying to extract the last % (i.e.0.10% in this case) using
"$" like this:

g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
fails in both cases.

the result is empty

it does extract the first one (0.15%) if i dont use "$"

what is wrong?


p.s. Nick, th_aft is not a terminator, its not always there

2011/8/27 Nick Cox <[email protected]>:
It is not obvious to me that you need -regexm()- at all.

The text " th_aft" appears to be just a terminator that you don't care
about, so remove it.

replace j = subinstr(j, " th_aft", "", .)

The last element can be separated off and then removed.

gen last = word(j, -1)

replace j = reverse(j)
replace j = subinstr(j, word(j,1) , "", 1)
replace j = reverse(j)

We reverse it in order to avoid removing any identical substring.

Those three lines could be telescoped into one.

Then it looks like an exercise in -subinstr()- and -split-.


On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]> wrote:

Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code. There are other ways to approach this, but I think the logic here is easy to follow:

*************! watch for wrapping:

**example data:
inp str70(j)
"A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
"A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
"A: 1.0%-$109(M) 0.1% th_aft."
"A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."

**regexm example == easier to use -split- initially
g example = regexs(0) ///
 if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
drop example

replace j = subinstr(j, "A: ", "", 1)
split j, p("(M) ")

**first, find x10 :
g x10 = ""

tempvar flag
g `flag' = ""
foreach var of varlist j? {
replace `flag' = "`var'" if ///
       strpos(`var', "th_aft")>0
replace x10  = subinstr(`var', "th_aft.", "", .) ///
        if `flag' == "`var'"
replace `var' = "" if strpos(`var', "th_aft")>0

**now, create x1-x9 and y1-y9
forval num = 1/9 {
 g x`num' = ""
 g y`num' = ""
 cap replace x`num' = regexs(0) if ///
       regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
       & !mi(j`num') & mi(x`num') //probably overkill
 cap replace y`num' = regexs(0) if ///
       regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
       & !mi(j`num') & mi(y`num')
**finally, create y10 == y2:
 g y10 = y2

l *1
l *2
l *3

- Eric

On Aug 26, 2011, at 6:59 PM, KOTa wrote:

I am trying to extract some data from text variable and being new to
stata programming struggling with finding right format.

my problem is as following:

for example i have string variable as following: "A: 0.35%- $100(M)
0.30%-$300(M) 0.27% th_aft."

number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"

I have 10 pairs of variables X1 Y1 .... X10 Y10

my goal is to extract all pairs from the string variable and split
them into my separate variables.

in this case the result should be:

X1  = 0.35%
Y1 = $100

X2 = 0.30%
Y2 = $300

X3-X9 = y3-Y9 = 0

X10 = 0.27%
Y10 = Y2 (i.e. last Y extracted from sting)

I am trying to use regexm but unsuccessfully, Any suggestions?

*   For searches and help try:

*   For searches and help try:

*   For searches and help try:

*   For searches and help try:

*   For searches and help try:

*   For searches and help try:
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index