Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
KOTa <kota.alba@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: regexm |

Date |
Sat, 27 Aug 2011 16:22:47 +0200 |

simplier in logistics way. i.e. i tried to do the whole thing withot creating additional variables (that split creates) in the middle. another question, if you know. also about strings. when i import file to stata (from excel, for example) i have some very long strings, that stata cuts to 244 chars. is there any trick to go around it? except making them shorter before importing :) thank you 2011/8/27 Nick Cox <njcoxstata@gmail.com>: > Better in what sense? Quicker to get a solution? Simpler? Other criteria? > > I don't know a way of counting more than 9 matches directly. I think > you would need, if you continue to follow that path, to loop over a > string repeatedly finding new instances and counting. > > See also -moss- from SSC. > > Nick > > On Sat, Aug 27, 2011 at 2:52 PM, KOTa <kota.alba@gmail.com> wrote: >> yes, i do work now with split, just thought with regex it will be better. >> >> anyway, is there a way to find out how many expressions regexm finds? >> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs, >> but if i dont know how many there are -> i dont know which one is >> last. >> 2. what if more the 9 expressions found? according to manual regexs >> only can have 0-9 parameters. >> >> >> thanks >> >> 2011/8/27 Nick Cox <njcoxstata@gmail.com>: >>> Well, you did say "it always ends by "% th_aft". >>> >>> I will continue as I started. >>> >>> If you first blank out stuff you don't need then you can just use >>> -split- to separate out elements. If you parse on spaces then it is >>> immaterial when you have 2 or 3 digits before, you retrieve the number >>> either way. >>> >>> No need for regex demonstrated. >>> >>> Nick >>> >>> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <kota.alba@gmail.com> wrote: >>>> thanks Eric, Nick I used your advices and almost finished. >>>> >>>> but encountered one small problems on the way. >>>> >>>> i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B) >>>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's >>>> not constant >>>> >>>> and i am trying to extract the last % (i.e.0.10% in this case) using >>>> "$" like this: >>>> >>>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g >>>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it >>>> fails in both cases. >>>> >>>> the result is empty >>>> >>>> it does extract the first one (0.15%) if i dont use "$" >>>> >>>> what is wrong? >>>> >>>> thanks >>>> >>>> p.s. Nick, th_aft is not a terminator, its not always there >>>> >>>> >>>> 2011/8/27 Nick Cox <njcoxstata@gmail.com>: >>>>> It is not obvious to me that you need -regexm()- at all. >>>>> >>>>> The text " th_aft" appears to be just a terminator that you don't care >>>>> about, so remove it. >>>>> >>>>> replace j = subinstr(j, " th_aft", "", .) >>>>> >>>>> The last element can be separated off and then removed. >>>>> >>>>> gen last = word(j, -1) >>>>> >>>>> replace j = reverse(j) >>>>> replace j = subinstr(j, word(j,1) , "", 1) >>>>> replace j = reverse(j) >>>>> >>>>> We reverse it in order to avoid removing any identical substring. >>>>> >>>>> Those three lines could be telescoped into one. >>>>> >>>>> Then it looks like an exercise in -subinstr()- and -split-. >>>>> >>>>> Nick >>>>> >>>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <ebooth@ppri.tamu.edu> wrote: >>>>>> <> >>>>>> >>>>>> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code. >>>>>> There are other ways to approach this, but I think the logic here is easy to follow: >>>>>> >>>>>> *************! watch for wrapping: >>>>>> >>>>>> **example data: >>>>>> clear >>>>>> inp str70(j) >>>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft." >>>>>> "A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft." >>>>>> "A: 1.0%-$109(M) 0.1% th_aft." >>>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft." >>>>>> end >>>>>> >>>>>> >>>>>> >>>>>> **regexm example == easier to use -split- initially >>>>>> g example = regexs(0) /// >>>>>> if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))") >>>>>> l >>>>>> drop example >>>>>> >>>>>> >>>>>> **split: >>>>>> replace j = subinstr(j, "A: ", "", 1) >>>>>> split j, p("(M) ") >>>>>> >>>>>> **first, find x10 : >>>>>> g x10 = "" >>>>>> >>>>>> tempvar flag >>>>>> g `flag' = "" >>>>>> foreach var of varlist j? { >>>>>> replace `flag' = "`var'" if /// >>>>>> strpos(`var', "th_aft")>0 >>>>>> replace x10 = subinstr(`var', "th_aft.", "", .) /// >>>>>> if `flag' == "`var'" >>>>>> replace `var' = "" if strpos(`var', "th_aft")>0 >>>>>> } >>>>>> >>>>>> >>>>>> **now, create x1-x9 and y1-y9 >>>>>> forval num = 1/9 { >>>>>> g x`num' = "" >>>>>> g y`num' = "" >>>>>> cap replace x`num' = regexs(0) if /// >>>>>> regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") /// >>>>>> & !mi(j`num') & mi(x`num') //probably overkill >>>>>> cap replace y`num' = regexs(0) if /// >>>>>> regexm(j`num', "([\$][0-9]*\.?[0-9]*)") /// >>>>>> & !mi(j`num') & mi(y`num') >>>>>> } >>>>>> **finally, create y10 == y2: >>>>>> g y10 = y2 >>>>>> >>>>>> >>>>>> ****list: >>>>>> l *1 >>>>>> l *2 >>>>>> l *3 >>>>>> >>>>>> *************! >>>>>> - Eric >>>>>> >>>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote: >>>>> >>>>>>> I am trying to extract some data from text variable and being new to >>>>>>> stata programming struggling with finding right format. >>>>>>> >>>>>>> my problem is as following: >>>>>>> >>>>>>> for example i have string variable as following: "A: 0.35%-$100(M) >>>>>>> 0.30%-$300(M) 0.27% th_aft." >>>>>>> >>>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft" >>>>>>> >>>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10 >>>>>>> >>>>>>> my goal is to extract all pairs from the string variable and split >>>>>>> them into my separate variables. >>>>>>> >>>>>>> in this case the result should be: >>>>>>> >>>>>>> X1 = 0.35% >>>>>>> Y1 = $100 >>>>>>> >>>>>>> X2 = 0.30% >>>>>>> Y2 = $300 >>>>>>> >>>>>>> X3-X9 = y3-Y9 = 0 >>>>>>> >>>>>>> X10 = 0.27% >>>>>>> Y10 = Y2 (i.e. last Y extracted from sting) >>>>>>> >>>>>>> I am trying to use regexm but unsuccessfully, Any suggestions? >>>>>>> >>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/statalist/faq >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/statalist/faq >>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: regexm***From:*Eric Booth <ebooth@ppri.tamu.edu>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Eric Booth <ebooth@ppri.tamu.edu>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: regexm** - Next by Date:
**re: Re: st: placing two textboxes in a graph on the x-axis** - Previous by thread:
**Re: st: regexm** - Next by thread:
**Re: st: regexm** - Index(es):