Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: regexm
Date   Sat, 27 Aug 2011 08:43:18 +0100

It is not obvious to me that you need -regexm()- at all.

The text " th_aft" appears to be just a terminator that you don't care
about, so remove it.

replace j = subinstr(j, " th_aft", "", .)

The last element can be separated off and then removed.

gen last = word(j, -1)

replace j = reverse(j)
replace j = subinstr(j, word(j,1) , "", 1)
replace j = reverse(j)

We reverse it in order to avoid removing any identical substring.

Those three lines could be telescoped into one.

Then it looks like an exercise in -subinstr()- and -split-.

Nick

On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <ebooth@ppri.tamu.edu> wrote:
> <>
>
> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
>  There are other ways to approach this, but I think the logic here is easy to follow:
>
> *************! watch for wrapping:
>
> **example data:
> clear
> inp str70(j)
> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
> "A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
> "A: 1.0%-$109(M) 0.1% th_aft."
> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
> end
>
>
>
> **regexm example == easier to use -split- initially
> g example = regexs(0) ///
>  if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
> l
> drop example
>
>
> **split:
> replace j = subinstr(j, "A: ", "", 1)
> split j, p("(M) ")
>
> **first, find x10 :
> g x10 = ""
>
> tempvar flag
> g `flag' = ""
> foreach var of varlist j? {
> replace `flag' = "`var'" if ///
>        strpos(`var', "th_aft")>0
> replace x10  = subinstr(`var', "th_aft.", "", .) ///
>         if `flag' == "`var'"
> replace `var' = "" if strpos(`var', "th_aft")>0
>        }
>
>
> **now, create x1-x9 and y1-y9
> forval num = 1/9 {
>  g x`num' = ""
>  g y`num' = ""
>  cap replace x`num' = regexs(0) if ///
>        regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>        & !mi(j`num') & mi(x`num') //probably overkill
>  cap replace y`num' = regexs(0) if ///
>        regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>        & !mi(j`num') & mi(y`num')
>        }
> **finally, create y10 == y2:
>  g y10 = y2
>
>
> ****list:
> l *1
> l *2
> l *3
>
> *************!
> - Eric
>
> On Aug 26, 2011, at 6:59 PM, KOTa wrote:

>> I am trying to extract some data from text variable and being new to
>> stata programming struggling with finding right format.
>>
>> my problem is as following:
>>
>> for example i have string variable as following: "A: 0.35%-$100(M)
>> 0.30%-$300(M) 0.27% th_aft."
>>
>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
>>
>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>
>> my goal is to extract all pairs from the string variable and split
>> them into my separate variables.
>>
>> in this case the result should be:
>>
>> X1  = 0.35%
>> Y1 = $100
>>
>> X2 = 0.30%
>> Y2 = $300
>>
>> X3-X9 = y3-Y9 = 0
>>
>> X10 = 0.27%
>> Y10 = Y2 (i.e. last Y extracted from sting)
>>
>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index