Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: regexm
Date	Sat, 27 Aug 2011 15:33:44 +0100

Strings longer than 244 characters cannot be read into variables. Youcould read them into Mata.


As said, do look at -moss-.

Nick

On 27 Aug 2011, at 15:22, KOTa <[email protected]> wrote:

simplier in logistics way. i.e. i tried to do the whole thing withot
creating additional variables (that split creates) in the middle.

another question, if you know. also about strings. when i import file
to stata (from excel, for example) i have some very long strings, that
stata cuts to 244 chars.

is there any trick to go around it? except making them shorter before
importing :)

thank you

2011/8/27 Nick Cox <[email protected]>:

Better in what sense? Quicker to get a solution? Simpler? Othercriteria?


I don't know a way of counting more than 9 matches directly. I think
you would need, if you continue to follow that path, to loop over a
string repeatedly finding new instances and counting.

See also -moss- from SSC.

Nick

On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:

yes, i do work now with split, just thought with regex it will bebetter.

anyway, is there a way to find out how many expressions regexmfinds?

1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
but if i dont know how many there are -> i dont know which one is
last.
2. what if more the 9 expressions found? according to manual regexs
only can have 0-9 parameters.


thanks

2011/8/27 Nick Cox <[email protected]>:

Well, you did say "it always ends by "% th_aft".

I will continue as I started.

If you first blank out stuff you don't need then you can just use
-split- to separate out elements. If you parse on spaces then it is

immaterial when you have 2 or 3 digits before, you retrieve thenumber

either way.

No need for regex demonstrated.

Nick

On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:

thanks Eric, Nick I used your advices and almost finished.

but encountered one small problems on the way.

i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)0.10% th_aft." - number of digits after the dot can be 2 or 3,it's

not constant

and i am trying to extract the last % (i.e.0.10% in this case)using

"$" like this:

g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$")or gexample = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$")and it

fails in both cases.

the result is empty

it does extract the first one (0.15%) if i dont use "$"

what is wrong?

thanks

p.s. Nick, th_aft is not a terminator, its not always there


2011/8/27 Nick Cox <[email protected]>:

It is not obvious to me that you need -regexm()- at all.

The text " th_aft" appears to be just a terminator that youdon't care

about, so remove it.

replace j = subinstr(j, " th_aft", "", .)

The last element can be separated off and then removed.

gen last = word(j, -1)

replace j = reverse(j)
replace j = subinstr(j, word(j,1) , "", 1)
replace j = reverse(j)

We reverse it in order to avoid removing any identical substring.

Those three lines could be telescoped into one.

Then it looks like an exercise in -subinstr()- and -split-.

Nick

On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth<[email protected]> wrote:

<>

Here's an example...note that I messed with the formatting ofthe %'s and $'s in my example data a bit to show how flexiblethe -regex- is in the latter part of the code; however, you'llneed to check that there aren't other patterns/symbols in yourstring that could break my code.There are other ways to approach this, but I think the logichere is easy to follow:


*************! watch for wrapping:

**example data:
clear
inp str70(j)
"A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
"A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
"A: 1.0%-$109(M) 0.1% th_aft."
"A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
end



**regexm example == easier to use -split- initially
g example = regexs(0) ///
 if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
l
drop example


**split:
replace j = subinstr(j, "A: ", "", 1)
split j, p("(M) ")

**first, find x10 :
g x10 = ""

tempvar flag
g `flag' = ""
foreach var of varlist j? {
replace `flag' = "`var'" if ///
       strpos(`var', "th_aft")>0
replace x10  = subinstr(`var', "th_aft.", "", .) ///
        if `flag' == "`var'"
replace `var' = "" if strpos(`var', "th_aft")>0
       }


**now, create x1-x9 and y1-y9
forval num = 1/9 {
 g x`num' = ""
 g y`num' = ""
 cap replace x`num' = regexs(0) if ///
       regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
       & !mi(j`num') & mi(x`num') //probably overkill
 cap replace y`num' = regexs(0) if ///
       regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
       & !mi(j`num') & mi(y`num')
       }
**finally, create y10 == y2:
 g y10 = y2


****list:
l *1
l *2
l *3

*************!
- Eric

On Aug 26, 2011, at 6:59 PM, KOTa wrote:

I am trying to extract some data from text variable and beingnew to
stata programming struggling with finding right format.

my problem is as following:
for example i have string variable as following: "A: 0.35%-$100(M)
0.30%-$300(M) 0.27% th_aft."
number of pairs "% - (M)" can be from 1 to 9 and it alwaysends by "% th_aft"
I have 10 pairs of variables X1 Y1 .... X10 Y10
my goal is to extract all pairs from the string variable andsplit
them into my separate variables.

in this case the result should be:

X1  = 0.35%
Y1 = $100

X2 = 0.30%
Y2 = $300

X3-X9 = y3-Y9 = 0

X10 = 0.27%
Y10 = Y2 (i.e. last Y extracted from sting)

I am trying to use regexm but unsuccessfully, Any suggestions?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: regexm
  - From: Robert Picard <[email protected]>

References:
- st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Eric Booth <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>

Prev by Date: re: Re: st: placing two textboxes in a graph on the x-axis
Next by Date: Re: st: regexm
Previous by thread: Re: st: regexm
Next by thread: Re: st: regexm
Index(es):
- Date
- Thread