Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
KOTa <kota.alba@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: regexm |

Date |
Sat, 27 Aug 2011 15:16:09 +0200 |

thanks Eric, Nick I used your advices and almost finished. but encountered one small problems on the way. i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B) 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's not constant and i am trying to extract the last % (i.e.0.10% in this case) using "$" like this: g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it fails in both cases. the result is empty it does extract the first one (0.15%) if i dont use "$" what is wrong? thanks p.s. Nick, th_aft is not a terminator, its not always there 2011/8/27 Nick Cox <njcoxstata@gmail.com>: > It is not obvious to me that you need -regexm()- at all. > > The text " th_aft" appears to be just a terminator that you don't care > about, so remove it. > > replace j = subinstr(j, " th_aft", "", .) > > The last element can be separated off and then removed. > > gen last = word(j, -1) > > replace j = reverse(j) > replace j = subinstr(j, word(j,1) , "", 1) > replace j = reverse(j) > > We reverse it in order to avoid removing any identical substring. > > Those three lines could be telescoped into one. > > Then it looks like an exercise in -subinstr()- and -split-. > > Nick > > On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <ebooth@ppri.tamu.edu> wrote: >> <> >> >> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code. >> There are other ways to approach this, but I think the logic here is easy to follow: >> >> *************! watch for wrapping: >> >> **example data: >> clear >> inp str70(j) >> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft." >> "A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft." >> "A: 1.0%-$109(M) 0.1% th_aft." >> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft." >> end >> >> >> >> **regexm example == easier to use -split- initially >> g example = regexs(0) /// >> if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))") >> l >> drop example >> >> >> **split: >> replace j = subinstr(j, "A: ", "", 1) >> split j, p("(M) ") >> >> **first, find x10 : >> g x10 = "" >> >> tempvar flag >> g `flag' = "" >> foreach var of varlist j? { >> replace `flag' = "`var'" if /// >> strpos(`var', "th_aft")>0 >> replace x10 = subinstr(`var', "th_aft.", "", .) /// >> if `flag' == "`var'" >> replace `var' = "" if strpos(`var', "th_aft")>0 >> } >> >> >> **now, create x1-x9 and y1-y9 >> forval num = 1/9 { >> g x`num' = "" >> g y`num' = "" >> cap replace x`num' = regexs(0) if /// >> regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") /// >> & !mi(j`num') & mi(x`num') //probably overkill >> cap replace y`num' = regexs(0) if /// >> regexm(j`num', "([\$][0-9]*\.?[0-9]*)") /// >> & !mi(j`num') & mi(y`num') >> } >> **finally, create y10 == y2: >> g y10 = y2 >> >> >> ****list: >> l *1 >> l *2 >> l *3 >> >> *************! >> - Eric >> >> On Aug 26, 2011, at 6:59 PM, KOTa wrote: > >>> I am trying to extract some data from text variable and being new to >>> stata programming struggling with finding right format. >>> >>> my problem is as following: >>> >>> for example i have string variable as following: "A: 0.35%-$100(M) >>> 0.30%-$300(M) 0.27% th_aft." >>> >>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft" >>> >>> I have 10 pairs of variables X1 Y1 .... X10 Y10 >>> >>> my goal is to extract all pairs from the string variable and split >>> them into my separate variables. >>> >>> in this case the result should be: >>> >>> X1 = 0.35% >>> Y1 = $100 >>> >>> X2 = 0.30% >>> Y2 = $300 >>> >>> X3-X9 = y3-Y9 = 0 >>> >>> X10 = 0.27% >>> Y10 = Y2 (i.e. last Y extracted from sting) >>> >>> I am trying to use regexm but unsuccessfully, Any suggestions? >>> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Eric Booth <ebooth@ppri.tamu.edu>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: RE: e(wexp) versus e(wexp): different routines return different things** - Next by Date:
**Re: st: regexm** - Previous by thread:
**Re: st: regexm** - Next by thread:
**Re: st: regexm** - Index(es):