Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
KOTa <[email protected]> |

To |
[email protected] |

Subject |
Re: st: regexm |

Date |
Sat, 27 Aug 2011 17:14:51 +0200 |

thanks 2011/8/27 Robert Picard <[email protected]>: > I second looking at -moss- from SSC. Try: > > moss svar, match("([0-9\.]+)") regex > > Robert > > On Sat, Aug 27, 2011 at 10:33 AM, Nick Cox <[email protected]> wrote: >> Strings longer than 244 characters cannot be read into variables. You could >> read them into Mata. >> >> As said, do look at -moss-. >> >> Nick >> >> On 27 Aug 2011, at 15:22, KOTa <[email protected]> wrote: >> >>> simplier in logistics way. i.e. i tried to do the whole thing withot >>> creating additional variables (that split creates) in the middle. >>> >>> another question, if you know. also about strings. when i import file >>> to stata (from excel, for example) i have some very long strings, that >>> stata cuts to 244 chars. >>> >>> is there any trick to go around it? except making them shorter before >>> importing :) >>> >>> thank you >>> >>> 2011/8/27 Nick Cox <[email protected]>: >>>> >>>> Better in what sense? Quicker to get a solution? Simpler? Other criteria? >>>> >>>> I don't know a way of counting more than 9 matches directly. I think >>>> you would need, if you continue to follow that path, to loop over a >>>> string repeatedly finding new instances and counting. >>>> >>>> See also -moss- from SSC. >>>> >>>> Nick >>>> >>>> On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote: >>>>> >>>>> yes, i do work now with split, just thought with regex it will be >>>>> better. >>>>> >>>>> anyway, is there a way to find out how many expressions regexm finds? >>>>> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs, >>>>> but if i dont know how many there are -> i dont know which one is >>>>> last. >>>>> 2. what if more the 9 expressions found? according to manual regexs >>>>> only can have 0-9 parameters. >>>>> >>>>> >>>>> thanks >>>>> >>>>> 2011/8/27 Nick Cox <[email protected]>: >>>>>> >>>>>> Well, you did say "it always ends by "% th_aft". >>>>>> >>>>>> I will continue as I started. >>>>>> >>>>>> If you first blank out stuff you don't need then you can just use >>>>>> -split- to separate out elements. If you parse on spaces then it is >>>>>> immaterial when you have 2 or 3 digits before, you retrieve the number >>>>>> either way. >>>>>> >>>>>> No need for regex demonstrated. >>>>>> >>>>>> Nick >>>>>> >>>>>> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote: >>>>>>> >>>>>>> thanks Eric, Nick I used your advices and almost finished. >>>>>>> >>>>>>> but encountered one small problems on the way. >>>>>>> >>>>>>> i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B) >>>>>>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's >>>>>>> not constant >>>>>>> >>>>>>> and i am trying to extract the last % (i.e.0.10% in this case) using >>>>>>> "$" like this: >>>>>>> >>>>>>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g >>>>>>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it >>>>>>> fails in both cases. >>>>>>> >>>>>>> the result is empty >>>>>>> >>>>>>> it does extract the first one (0.15%) if i dont use "$" >>>>>>> >>>>>>> what is wrong? >>>>>>> >>>>>>> thanks >>>>>>> >>>>>>> p.s. Nick, th_aft is not a terminator, its not always there >>>>>>> >>>>>>> >>>>>>> 2011/8/27 Nick Cox <[email protected]>: >>>>>>>> >>>>>>>> It is not obvious to me that you need -regexm()- at all. >>>>>>>> >>>>>>>> The text " th_aft" appears to be just a terminator that you don't >>>>>>>> care >>>>>>>> about, so remove it. >>>>>>>> >>>>>>>> replace j = subinstr(j, " th_aft", "", .) >>>>>>>> >>>>>>>> The last element can be separated off and then removed. >>>>>>>> >>>>>>>> gen last = word(j, -1) >>>>>>>> >>>>>>>> replace j = reverse(j) >>>>>>>> replace j = subinstr(j, word(j,1) , "", 1) >>>>>>>> replace j = reverse(j) >>>>>>>> >>>>>>>> We reverse it in order to avoid removing any identical substring. >>>>>>>> >>>>>>>> Those three lines could be telescoped into one. >>>>>>>> >>>>>>>> Then it looks like an exercise in -subinstr()- and -split-. >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> <> >>>>>>>>> >>>>>>>>> Here's an example...note that I messed with the formatting of the >>>>>>>>> %'s and $'s in my example data a bit to show how flexible the -regex- is in >>>>>>>>> the latter part of the code; however, you'll need to check that there aren't >>>>>>>>> other patterns/symbols in your string that could break my code. >>>>>>>>> There are other ways to approach this, but I think the logic here >>>>>>>>> is easy to follow: >>>>>>>>> >>>>>>>>> *************! watch for wrapping: >>>>>>>>> >>>>>>>>> **example data: >>>>>>>>> clear >>>>>>>>> inp str70(j) >>>>>>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft." >>>>>>>>> "A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft." >>>>>>>>> "A: 1.0%-$109(M) 0.1% th_aft." >>>>>>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft." >>>>>>>>> end >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> **regexm example == easier to use -split- initially >>>>>>>>> g example = regexs(0) /// >>>>>>>>> if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))") >>>>>>>>> l >>>>>>>>> drop example >>>>>>>>> >>>>>>>>> >>>>>>>>> **split: >>>>>>>>> replace j = subinstr(j, "A: ", "", 1) >>>>>>>>> split j, p("(M) ") >>>>>>>>> >>>>>>>>> **first, find x10 : >>>>>>>>> g x10 = "" >>>>>>>>> >>>>>>>>> tempvar flag >>>>>>>>> g `flag' = "" >>>>>>>>> foreach var of varlist j? { >>>>>>>>> replace `flag' = "`var'" if /// >>>>>>>>> strpos(`var', "th_aft")>0 >>>>>>>>> replace x10 = subinstr(`var', "th_aft.", "", .) /// >>>>>>>>> if `flag' == "`var'" >>>>>>>>> replace `var' = "" if strpos(`var', "th_aft")>0 >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> **now, create x1-x9 and y1-y9 >>>>>>>>> forval num = 1/9 { >>>>>>>>> g x`num' = "" >>>>>>>>> g y`num' = "" >>>>>>>>> cap replace x`num' = regexs(0) if /// >>>>>>>>> regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") /// >>>>>>>>> & !mi(j`num') & mi(x`num') //probably overkill >>>>>>>>> cap replace y`num' = regexs(0) if /// >>>>>>>>> regexm(j`num', "([\$][0-9]*\.?[0-9]*)") /// >>>>>>>>> & !mi(j`num') & mi(y`num') >>>>>>>>> } >>>>>>>>> **finally, create y10 == y2: >>>>>>>>> g y10 = y2 >>>>>>>>> >>>>>>>>> >>>>>>>>> ****list: >>>>>>>>> l *1 >>>>>>>>> l *2 >>>>>>>>> l *3 >>>>>>>>> >>>>>>>>> *************! >>>>>>>>> - Eric >>>>>>>>> >>>>>>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote: >>>>>>>> >>>>>>>>>> I am trying to extract some data from text variable and being new >>>>>>>>>> to >>>>>>>>>> stata programming struggling with finding right format. >>>>>>>>>> >>>>>>>>>> my problem is as following: >>>>>>>>>> >>>>>>>>>> for example i have string variable as following: "A: 0.35%-$100(M) >>>>>>>>>> 0.30%-$300(M) 0.27% th_aft." >>>>>>>>>> >>>>>>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by >>>>>>>>>> "% th_aft" >>>>>>>>>> >>>>>>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10 >>>>>>>>>> >>>>>>>>>> my goal is to extract all pairs from the string variable and split >>>>>>>>>> them into my separate variables. >>>>>>>>>> >>>>>>>>>> in this case the result should be: >>>>>>>>>> >>>>>>>>>> X1 = 0.35% >>>>>>>>>> Y1 = $100 >>>>>>>>>> >>>>>>>>>> X2 = 0.30% >>>>>>>>>> Y2 = $300 >>>>>>>>>> >>>>>>>>>> X3-X9 = y3-Y9 = 0 >>>>>>>>>> >>>>>>>>>> X10 = 0.27% >>>>>>>>>> Y10 = Y2 (i.e. last Y extracted from sting) >>>>>>>>>> >>>>>>>>>> I am trying to use regexm but unsuccessfully, Any suggestions? >>>>>>>>>> >>>>>>>> >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/statalist/faq >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>> >>>>>>> >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/statalist/faq >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> >>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/statalist/faq >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/statalist/faq >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/statalist/faq >>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: regexm***From:*KOTa <[email protected]>

**Re: st: regexm***From:*Eric Booth <[email protected]>

**Re: st: regexm***From:*Nick Cox <[email protected]>

**Re: st: regexm***From:*KOTa <[email protected]>

**Re: st: regexm***From:*Nick Cox <[email protected]>

**Re: st: regexm***From:*KOTa <[email protected]>

**Re: st: regexm***From:*Nick Cox <[email protected]>

**Re: st: regexm***From:*KOTa <[email protected]>

**Re: st: regexm***From:*Nick Cox <[email protected]>

**Re: st: regexm***From:*Robert Picard <[email protected]>

- Prev by Date:
**Re: st: regexm** - Next by Date:
**Re: st: Marginal effect at each category of dependent variable after ologit using margins command** - Previous by thread:
**Re: st: regexm** - Next by thread:
**Re: st: regexm** - Index(es):