Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm


From   KOTa <kota.alba@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: regexm
Date   Sat, 27 Aug 2011 16:22:47 +0200

simplier in logistics way. i.e. i tried to do the whole thing withot
creating additional variables (that split creates) in the middle.

another question, if you know. also about strings. when i import file
to stata (from excel, for example) i have some very long strings, that
stata cuts to 244 chars.

is there any trick to go around it? except making them shorter before
importing :)

thank you

2011/8/27 Nick Cox <njcoxstata@gmail.com>:
> Better in what sense? Quicker to get a solution? Simpler? Other criteria?
>
> I don't know a way of counting more than 9 matches directly. I think
> you would need, if you continue to follow that path, to loop over a
> string repeatedly finding new instances and counting.
>
> See also -moss- from SSC.
>
> Nick
>
> On Sat, Aug 27, 2011 at 2:52 PM, KOTa <kota.alba@gmail.com> wrote:
>> yes, i do work now with split, just thought with regex it will be better.
>>
>> anyway, is there a way to find out how many expressions regexm finds?
>> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
>> but if i dont know how many there are -> i dont know which one is
>> last.
>> 2. what if more the 9 expressions found? according to manual regexs
>> only can have 0-9 parameters.
>>
>>
>> thanks
>>
>> 2011/8/27 Nick Cox <njcoxstata@gmail.com>:
>>> Well, you did say "it always ends by "% th_aft".
>>>
>>> I will continue as I started.
>>>
>>> If you first blank out stuff you don't need then you can just use
>>> -split- to separate out elements. If you parse on spaces then it is
>>> immaterial when you have 2 or 3 digits before, you retrieve the number
>>> either way.
>>>
>>> No need for regex demonstrated.
>>>
>>> Nick
>>>
>>> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <kota.alba@gmail.com> wrote:
>>>> thanks Eric, Nick I used your advices and almost finished.
>>>>
>>>> but encountered one small problems on the way.
>>>>
>>>> i have the same type of string -  "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
>>>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
>>>> not constant
>>>>
>>>> and i am trying to extract the last % (i.e.0.10% in this case) using
>>>> "$" like this:
>>>>
>>>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
>>>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
>>>> fails in both cases.
>>>>
>>>> the result is empty
>>>>
>>>> it does extract the first one (0.15%) if i dont use "$"
>>>>
>>>> what is wrong?
>>>>
>>>> thanks
>>>>
>>>> p.s. Nick, th_aft is not a terminator, its not always there
>>>>
>>>>
>>>> 2011/8/27 Nick Cox <njcoxstata@gmail.com>:
>>>>> It is not obvious to me that you need -regexm()- at all.
>>>>>
>>>>> The text " th_aft" appears to be just a terminator that you don't care
>>>>> about, so remove it.
>>>>>
>>>>> replace j = subinstr(j, " th_aft", "", .)
>>>>>
>>>>> The last element can be separated off and then removed.
>>>>>
>>>>> gen last = word(j, -1)
>>>>>
>>>>> replace j = reverse(j)
>>>>> replace j = subinstr(j, word(j,1) , "", 1)
>>>>> replace j = reverse(j)
>>>>>
>>>>> We reverse it in order to avoid removing any identical substring.
>>>>>
>>>>> Those three lines could be telescoped into one.
>>>>>
>>>>> Then it looks like an exercise in -subinstr()- and -split-.
>>>>>
>>>>> Nick
>>>>>
>>>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <ebooth@ppri.tamu.edu> wrote:
>>>>>> <>
>>>>>>
>>>>>> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
>>>>>>  There are other ways to approach this, but I think the logic here is easy to follow:
>>>>>>
>>>>>> *************! watch for wrapping:
>>>>>>
>>>>>> **example data:
>>>>>> clear
>>>>>> inp str70(j)
>>>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
>>>>>> "A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
>>>>>> "A: 1.0%-$109(M) 0.1% th_aft."
>>>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
>>>>>> end
>>>>>>
>>>>>>
>>>>>>
>>>>>> **regexm example == easier to use -split- initially
>>>>>> g example = regexs(0) ///
>>>>>>  if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
>>>>>> l
>>>>>> drop example
>>>>>>
>>>>>>
>>>>>> **split:
>>>>>> replace j = subinstr(j, "A: ", "", 1)
>>>>>> split j, p("(M) ")
>>>>>>
>>>>>> **first, find x10 :
>>>>>> g x10 = ""
>>>>>>
>>>>>> tempvar flag
>>>>>> g `flag' = ""
>>>>>> foreach var of varlist j? {
>>>>>> replace `flag' = "`var'" if ///
>>>>>>        strpos(`var', "th_aft")>0
>>>>>> replace x10  = subinstr(`var', "th_aft.", "", .) ///
>>>>>>         if `flag' == "`var'"
>>>>>> replace `var' = "" if strpos(`var', "th_aft")>0
>>>>>>        }
>>>>>>
>>>>>>
>>>>>> **now, create x1-x9 and y1-y9
>>>>>> forval num = 1/9 {
>>>>>>  g x`num' = ""
>>>>>>  g y`num' = ""
>>>>>>  cap replace x`num' = regexs(0) if ///
>>>>>>        regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>>>>>>        & !mi(j`num') & mi(x`num') //probably overkill
>>>>>>  cap replace y`num' = regexs(0) if ///
>>>>>>        regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>>>>>>        & !mi(j`num') & mi(y`num')
>>>>>>        }
>>>>>> **finally, create y10 == y2:
>>>>>>  g y10 = y2
>>>>>>
>>>>>>
>>>>>> ****list:
>>>>>> l *1
>>>>>> l *2
>>>>>> l *3
>>>>>>
>>>>>> *************!
>>>>>> - Eric
>>>>>>
>>>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote:
>>>>>
>>>>>>> I am trying to extract some data from text variable and being new to
>>>>>>> stata programming struggling with finding right format.
>>>>>>>
>>>>>>> my problem is as following:
>>>>>>>
>>>>>>> for example i have string variable as following: "A: 0.35%-$100(M)
>>>>>>> 0.30%-$300(M) 0.27% th_aft."
>>>>>>>
>>>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
>>>>>>>
>>>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>>>>>>
>>>>>>> my goal is to extract all pairs from the string variable and split
>>>>>>> them into my separate variables.
>>>>>>>
>>>>>>> in this case the result should be:
>>>>>>>
>>>>>>> X1  = 0.35%
>>>>>>> Y1 = $100
>>>>>>>
>>>>>>> X2 = 0.30%
>>>>>>> Y2 = $300
>>>>>>>
>>>>>>> X3-X9 = y3-Y9 = 0
>>>>>>>
>>>>>>> X10 = 0.27%
>>>>>>> Y10 = Y2 (i.e. last Y extracted from sting)
>>>>>>>
>>>>>>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/statalist/faq
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index