Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: regexm |

Date |
Sat, 27 Aug 2011 15:33:44 +0100 |

As said, do look at -moss-. Nick On 27 Aug 2011, at 15:22, KOTa <kota.alba@gmail.com> wrote:

simplier in logistics way. i.e. i tried to do the whole thing withot creating additional variables (that split creates) in the middle. another question, if you know. also about strings. when i import file to stata (from excel, for example) i have some very long strings, that stata cuts to 244 chars. is there any trick to go around it? except making them shorter before importing :) thank you 2011/8/27 Nick Cox <njcoxstata@gmail.com>:Better in what sense? Quicker to get a solution? Simpler? Othercriteria?I don't know a way of counting more than 9 matches directly. I think you would need, if you continue to follow that path, to loop over a string repeatedly finding new instances and counting. See also -moss- from SSC. Nick On Sat, Aug 27, 2011 at 2:52 PM, KOTa <kota.alba@gmail.com> wrote:yes, i do work now with split, just thought with regex it will bebetter.anyway, is there a way to find out how many expressions regexmfinds?1. what i mean is i can access the 1st 2nd etc up to 9 with regexs, but if i dont know how many there are -> i dont know which one is last. 2. what if more the 9 expressions found? according to manual regexs only can have 0-9 parameters. thanks 2011/8/27 Nick Cox <njcoxstata@gmail.com>:Well, you did say "it always ends by "% th_aft". I will continue as I started. If you first blank out stuff you don't need then you can just use -split- to separate out elements. If you parse on spaces then it isimmaterial when you have 2 or 3 digits before, you retrieve thenumbereither way. No need for regex demonstrated. Nick On Sat, Aug 27, 2011 at 2:16 PM, KOTa <kota.alba@gmail.com> wrote:thanks Eric, Nick I used your advices and almost finished. but encountered one small problems on the way.i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)0.10% th_aft." - number of digits after the dot can be 2 or 3,it'snot constantand i am trying to extract the last % (i.e.0.10% in this case)using"$" like this:g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$")or gexample = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$")and itfails in both cases. the result is empty it does extract the first one (0.15%) if i dont use "$" what is wrong? thanks p.s. Nick, th_aft is not a terminator, its not always there 2011/8/27 Nick Cox <njcoxstata@gmail.com>:It is not obvious to me that you need -regexm()- at all.The text " th_aft" appears to be just a terminator that youdon't careabout, so remove it. replace j = subinstr(j, " th_aft", "", .) The last element can be separated off and then removed. gen last = word(j, -1) replace j = reverse(j) replace j = subinstr(j, word(j,1) , "", 1) replace j = reverse(j) We reverse it in order to avoid removing any identical substring. Those three lines could be telescoped into one. Then it looks like an exercise in -subinstr()- and -split-. NickOn Sat, Aug 27, 2011 at 2:28 AM, Eric Booth<ebooth@ppri.tamu.edu> wrote:<>Here's an example...note that I messed with the formatting ofthe %'s and $'s in my example data a bit to show how flexiblethe -regex- is in the latter part of the code; however, you'llneed to check that there aren't other patterns/symbols in yourstring that could break my code.There are other ways to approach this, but I think the logichere is easy to follow:*************! watch for wrapping: **example data: clear inp str70(j) "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft." "A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft." "A: 1.0%-$109(M) 0.1% th_aft." "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft." end **regexm example == easier to use -split- initially g example = regexs(0) /// if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))") l drop example **split: replace j = subinstr(j, "A: ", "", 1) split j, p("(M) ") **first, find x10 : g x10 = "" tempvar flag g `flag' = "" foreach var of varlist j? { replace `flag' = "`var'" if /// strpos(`var', "th_aft")>0 replace x10 = subinstr(`var', "th_aft.", "", .) /// if `flag' == "`var'" replace `var' = "" if strpos(`var', "th_aft")>0 } **now, create x1-x9 and y1-y9 forval num = 1/9 { g x`num' = "" g y`num' = "" cap replace x`num' = regexs(0) if /// regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") /// & !mi(j`num') & mi(x`num') //probably overkill cap replace y`num' = regexs(0) if /// regexm(j`num', "([\$][0-9]*\.?[0-9]*)") /// & !mi(j`num') & mi(y`num') } **finally, create y10 == y2: g y10 = y2 ****list: l *1 l *2 l *3 *************! - Eric On Aug 26, 2011, at 6:59 PM, KOTa wrote:I am trying to extract some data from text variable and beingnew tostata programming struggling with finding right format. my problem is as following:for example i have string variable as following: "A: 0.35%-$100(M)0.30%-$300(M) 0.27% th_aft."number of pairs "% - (M)" can be from 1 to 9 and it alwaysends by "% th_aft"I have 10 pairs of variables X1 Y1 .... X10 Y10my goal is to extract all pairs from the string variable andsplitthem into my separate variables. in this case the result should be: X1 = 0.35% Y1 = $100 X2 = 0.30% Y2 = $300 X3-X9 = y3-Y9 = 0 X10 = 0.27% Y10 = Y2 (i.e. last Y extracted from sting) I am trying to use regexm but unsuccessfully, Any suggestions?* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: regexm***From:*Robert Picard <picard@netbox.com>

**References**:**st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Eric Booth <ebooth@ppri.tamu.edu>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: regexm***From:*KOTa <kota.alba@gmail.com>

**Re: st: regexm***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: regexm***From:*KOTa <kota.alba@gmail.com>

- Prev by Date:
**re: Re: st: placing two textboxes in a graph on the x-axis** - Next by Date:
**Re: st: regexm** - Previous by thread:
**Re: st: regexm** - Next by thread:
**Re: st: regexm** - Index(es):