Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: regexm
Date	Sat, 27 Aug 2011 14:31:55 +0100

Well, you did say "it always ends by "% th_aft".

I will continue as I started.

If you first blank out stuff you don't need then you can just use
-split- to separate out elements. If you parse on spaces then it is
immaterial when you have 2 or 3 digits before, you retrieve the number
either way.

No need for regex demonstrated.

Nick

On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
> thanks Eric, Nick I used your advices and almost finished.
>
> but encountered one small problems on the way.
>
> i have the same type of string -  "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
> not constant
>
> and i am trying to extract the last % (i.e.0.10% in this case) using
> "$" like this:
>
> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
> fails in both cases.
>
> the result is empty
>
> it does extract the first one (0.15%) if i dont use "$"
>
> what is wrong?
>
> thanks
>
> p.s. Nick, th_aft is not a terminator, its not always there
>
>
> 2011/8/27 Nick Cox <[email protected]>:
>> It is not obvious to me that you need -regexm()- at all.
>>
>> The text " th_aft" appears to be just a terminator that you don't care
>> about, so remove it.
>>
>> replace j = subinstr(j, " th_aft", "", .)
>>
>> The last element can be separated off and then removed.
>>
>> gen last = word(j, -1)
>>
>> replace j = reverse(j)
>> replace j = subinstr(j, word(j,1) , "", 1)
>> replace j = reverse(j)
>>
>> We reverse it in order to avoid removing any identical substring.
>>
>> Those three lines could be telescoped into one.
>>
>> Then it looks like an exercise in -subinstr()- and -split-.
>>
>> Nick
>>
>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]> wrote:
>>> <>
>>>
>>> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
>>>  There are other ways to approach this, but I think the logic here is easy to follow:
>>>
>>> *************! watch for wrapping:
>>>
>>> **example data:
>>> clear
>>> inp str70(j)
>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
>>> "A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
>>> "A: 1.0%-$109(M) 0.1% th_aft."
>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
>>> end
>>>
>>>
>>>
>>> **regexm example == easier to use -split- initially
>>> g example = regexs(0) ///
>>>  if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
>>> l
>>> drop example
>>>
>>>
>>> **split:
>>> replace j = subinstr(j, "A: ", "", 1)
>>> split j, p("(M) ")
>>>
>>> **first, find x10 :
>>> g x10 = ""
>>>
>>> tempvar flag
>>> g `flag' = ""
>>> foreach var of varlist j? {
>>> replace `flag' = "`var'" if ///
>>>        strpos(`var', "th_aft")>0
>>> replace x10  = subinstr(`var', "th_aft.", "", .) ///
>>>         if `flag' == "`var'"
>>> replace `var' = "" if strpos(`var', "th_aft")>0
>>>        }
>>>
>>>
>>> **now, create x1-x9 and y1-y9
>>> forval num = 1/9 {
>>>  g x`num' = ""
>>>  g y`num' = ""
>>>  cap replace x`num' = regexs(0) if ///
>>>        regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>>>        & !mi(j`num') & mi(x`num') //probably overkill
>>>  cap replace y`num' = regexs(0) if ///
>>>        regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>>>        & !mi(j`num') & mi(y`num')
>>>        }
>>> **finally, create y10 == y2:
>>>  g y10 = y2
>>>
>>>
>>> ****list:
>>> l *1
>>> l *2
>>> l *3
>>>
>>> *************!
>>> - Eric
>>>
>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote:
>>
>>>> I am trying to extract some data from text variable and being new to
>>>> stata programming struggling with finding right format.
>>>>
>>>> my problem is as following:
>>>>
>>>> for example i have string variable as following: "A: 0.35%-$100(M)
>>>> 0.30%-$300(M) 0.27% th_aft."
>>>>
>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
>>>>
>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>>>
>>>> my goal is to extract all pairs from the string variable and split
>>>> them into my separate variables.
>>>>
>>>> in this case the result should be:
>>>>
>>>> X1  = 0.35%
>>>> Y1 = $100
>>>>
>>>> X2 = 0.30%
>>>> Y2 = $300
>>>>
>>>> X3-X9 = y3-Y9 = 0
>>>>
>>>> X10 = 0.27%
>>>> Y10 = Y2 (i.e. last Y extracted from sting)
>>>>
>>>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: regexm
  - From: KOTa <[email protected]>

References:
- st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Eric Booth <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>

Prev by Date: Re: st: regexm
Next by Date: Re: st: regexm
Previous by thread: Re: st: regexm
Next by thread: Re: st: regexm
Index(es):
- Date
- Thread