Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: regexm

From	Eric Booth <[email protected]>
To	"<[email protected]>" <[email protected]>
Subject	Re: st: regexm
Date	Sat, 27 Aug 2011 20:32:59 +0000
<>

On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
> i have the same type of string -  "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
> not constant
> and i am trying to extract the last % (i.e.0.10% in this case) using
> "$" like this:
>
> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
> fails in both cases.
> the result is empty
> it does extract the first one (0.15%) if i dont use "$"

"$" at the end of your regexm won't work because .10% is not at end of the string you are searching.  Also, keep in mind that in your second search with the "$" removed:

g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+")

you say that Stata finds "0.15" instead of 0.10 -- this is because regexs matches the first occurrence of a substring in a string (hence the utility of -moss-) which is "0.15".


You are asking for a regexm() that can differentiate "0.15%" from "0.10%" -- it's clear that these elements themselves are not different so you need to look around them if this is the approach you want to take.  The main difference is that 0.10% has a space character both before and after it (unlike the percents earlier in the string), so (assuming this is true for observations) you could use:
**
clear
set obs 1
g x = "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B) 0.10% th_aft."
g example = regexs(0) if regexm( x, "[ ][0-9]+\.[0-9]*[%]+[ ]")
replace example = trim(example)
**
or just stick to the "**find x10" part of my example (which already finds this element) or use Nick's suggestions about -moss- or -split- to solve this.


On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:
> anyway, is there a way to find out how many expressions regexm finds?
> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
> but if i dont know how many there are -> i dont know which one is
> last.
> 2. what if more the 9 expressions found? according to manual regexs
> only can have 0-9 parameters.

Where you asked about iterating the regexs(0) to grab substrings in a string, I think you were asking whether you could grab the nth substring match in a string, but that's not what regexs(n) does.  Iterating the number in regexs() does not move to the next match for your expression in the entire string, instead it gives you (up to 9) pieces of the matching string.
For example:

**
clear
set obs 1
g str40 x = "0.15% 0.17% 0.99 1.1%"
g ex = regexs(0) if regexm( x, "([0-9]+\.[0-9]*[%]+)([ ])([0-9]+\.[0-9]*[%]+)")
g ex1 = regexs(1) if regexm( x, "([0-9]+\.[0-9]*[%]+)([ ])([0-9]+\.[0-9]*[%]+)")
g ex2 = regexs(2) if regexm( x, "([0-9]+\.[0-9]*[%]+)([ ])([0-9]+\.[0-9]*[%]+)")
g ex3 = regexs(3) if regexm( x, "([0-9]+\.[0-9]*[%]+)([ ])([0-9]+\.[0-9]*[%]+)")
**

Note that the parentheses differentiate substrings of the first match made by the expression.  That is, the regexm() matches "0.15% 0.17%" as the first match and variable 'ex' returns the entire matching substring (regex(0)).  Iterating regexs to 1 returns the next subpart of this match,  "0.15%",  marked because it matches the first parenthetical subexpression in the regexm().  'ex2' contains the space character since thats the next subpart of the string.  Finally, ex3 contains last matching subpart (0.17%).


On Aug 27, 2011, at 9:22 AM, KOTa wrote:
> another question, if you know. also about strings. when i import file
> to stata (from excel, for example) i have some very long strings, that
> stata cuts to 244 chars.
> is there any trick to go around it? except making them shorter before
> importing :)

For your question about long strings: I've never figured out how to use mata to read them in and eventually get them (in pieces) into my Stata dataset -- but that's because I don't know much about mata and I haven't spent the time trying to figure out the mata commands to make this happen-- instead (and out of laziness) I've used -intext- (from SSC) to import the data into Stata (it will split the long string into multiple string variables of the length() you specify) and then you can proceed as needed.

__

I learned something new from both of Nick's posts about using -moss- or the word() operator with -split-.  I was doing it the (really) long way (I'd claim as my defense that I was trying to use a regex based solution since that was the question topic in the OP, but it's not true -- I just didn't think of/know about/explore the shorter approaches).  I imagine it's probably worth reading Nick's Stata Tip 98 in SJ to help supplement the discussion in this thread (just based on the title -- I haven't read it yet).

- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
+979.845.6754

On Aug 27, 2011, at 9:22 AM, KOTa wrote:

> simplier in logistics way. i.e. i tried to do the whole thing withot
> creating additional variables (that split creates) in the middle.
>
> another question, if you know. also about strings. when i import file
> to stata (from excel, for example) i have some very long strings, that
> stata cuts to 244 chars.
>
> is there any trick to go around it? except making them shorter before
> importing :)
>
> thank you
>
> 2011/8/27 Nick Cox <[email protected]>:
>> Better in what sense? Quicker to get a solution? Simpler? Other criteria?
>>
>> I don't know a way of counting more than 9 matches directly. I think
>> you would need, if you continue to follow that path, to loop over a
>> string repeatedly finding new instances and counting.
>>
>> See also -moss- from SSC.
>>
>> Nick
>>
>> On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:
>>> yes, i do work now with split, just thought with regex it will be better.
>>>
>>> anyway, is there a way to find out how many expressions regexm finds?
>>> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
>>> but if i dont know how many there are -> i dont know which one is
>>> last.
>>> 2. what if more the 9 expressions found? according to manual regexs
>>> only can have 0-9 parameters.
>>>
>>>
>>> thanks
>>>
>>> 2011/8/27 Nick Cox <[email protected]>:
>>>> Well, you did say "it always ends by "% th_aft".
>>>>
>>>> I will continue as I started.
>>>>
>>>> If you first blank out stuff you don't need then you can just use
>>>> -split- to separate out elements. If you parse on spaces then it is
>>>> immaterial when you have 2 or 3 digits before, you retrieve the number
>>>> either way.
>>>>
>>>> No need for regex demonstrated.
>>>>
>>>> Nick
>>>>
>>>> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
>>>>> thanks Eric, Nick I used your advices and almost finished.
>>>>>
>>>>> but encountered one small problems on the way.
>>>>>
>>>>> i have the same type of string -  "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
>>>>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
>>>>> not constant
>>>>>
>>>>> and i am trying to extract the last % (i.e.0.10% in this case) using
>>>>> "$" like this:
>>>>>
>>>>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
>>>>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
>>>>> fails in both cases.
>>>>>
>>>>> the result is empty
>>>>>
>>>>> it does extract the first one (0.15%) if i dont use "$"
>>>>>
>>>>> what is wrong?
>>>>>
>>>>> thanks
>>>>>
>>>>> p.s. Nick, th_aft is not a terminator, its not always there
>>>>>
>>>>>
>>>>> 2011/8/27 Nick Cox <[email protected]>:
>>>>>> It is not obvious to me that you need -regexm()- at all.
>>>>>>
>>>>>> The text " th_aft" appears to be just a terminator that you don't care
>>>>>> about, so remove it.
>>>>>>
>>>>>> replace j = subinstr(j, " th_aft", "", .)
>>>>>>
>>>>>> The last element can be separated off and then removed.
>>>>>>
>>>>>> gen last = word(j, -1)
>>>>>>
>>>>>> replace j = reverse(j)
>>>>>> replace j = subinstr(j, word(j,1) , "", 1)
>>>>>> replace j = reverse(j)
>>>>>>
>>>>>> We reverse it in order to avoid removing any identical substring.
>>>>>>
>>>>>> Those three lines could be telescoped into one.
>>>>>>
>>>>>> Then it looks like an exercise in -subinstr()- and -split-.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]> wrote:
>>>>>>> <>
>>>>>>>
>>>>>>> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
>>>>>>> There are other ways to approach this, but I think the logic here is easy to follow:
>>>>>>>
>>>>>>> *************! watch for wrapping:
>>>>>>>
>>>>>>> **example data:
>>>>>>> clear
>>>>>>> inp str70(j)
>>>>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
>>>>>>> "A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
>>>>>>> "A: 1.0%-$109(M) 0.1% th_aft."
>>>>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
>>>>>>> end
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> **regexm example == easier to use -split- initially
>>>>>>> g example = regexs(0) ///
>>>>>>> if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
>>>>>>> l
>>>>>>> drop example
>>>>>>>
>>>>>>>
>>>>>>> **split:
>>>>>>> replace j = subinstr(j, "A: ", "", 1)
>>>>>>> split j, p("(M) ")
>>>>>>>
>>>>>>> **first, find x10 :
>>>>>>> g x10 = ""
>>>>>>>
>>>>>>> tempvar flag
>>>>>>> g `flag' = ""
>>>>>>> foreach var of varlist j? {
>>>>>>> replace `flag' = "`var'" if ///
>>>>>>>       strpos(`var', "th_aft")>0
>>>>>>> replace x10  = subinstr(`var', "th_aft.", "", .) ///
>>>>>>>        if `flag' == "`var'"
>>>>>>> replace `var' = "" if strpos(`var', "th_aft")>0
>>>>>>>       }
>>>>>>>
>>>>>>>
>>>>>>> **now, create x1-x9 and y1-y9
>>>>>>> forval num = 1/9 {
>>>>>>> g x`num' = ""
>>>>>>> g y`num' = ""
>>>>>>> cap replace x`num' = regexs(0) if ///
>>>>>>>       regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>>>>>>>       & !mi(j`num') & mi(x`num') //probably overkill
>>>>>>> cap replace y`num' = regexs(0) if ///
>>>>>>>       regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>>>>>>>       & !mi(j`num') & mi(y`num')
>>>>>>>       }
>>>>>>> **finally, create y10 == y2:
>>>>>>> g y10 = y2
>>>>>>>
>>>>>>>
>>>>>>> ****list:
>>>>>>> l *1
>>>>>>> l *2
>>>>>>> l *3
>>>>>>>
>>>>>>> *************!
>>>>>>> - Eric
>>>>>>>
>>>>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote:
>>>>>>
>>>>>>>> I am trying to extract some data from text variable and being new to
>>>>>>>> stata programming struggling with finding right format.
>>>>>>>>
>>>>>>>> my problem is as following:
>>>>>>>>
>>>>>>>> for example i have string variable as following: "A: 0.35%-$100(M)
>>>>>>>> 0.30%-$300(M) 0.27% th_aft."
>>>>>>>>
>>>>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
>>>>>>>>
>>>>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>>>>>>>
>>>>>>>> my goal is to extract all pairs from the string variable and split
>>>>>>>> them into my separate variables.
>>>>>>>>
>>>>>>>> in this case the result should be:
>>>>>>>>
>>>>>>>> X1  = 0.35%
>>>>>>>> Y1 = $100
>>>>>>>>
>>>>>>>> X2 = 0.30%
>>>>>>>> Y2 = $300
>>>>>>>>
>>>>>>>> X3-X9 = y3-Y9 = 0
>>>>>>>>
>>>>>>>> X10 = 0.27%
>>>>>>>> Y10 = Y2 (i.e. last Y extracted from sting)
>>>>>>>>
>>>>>>>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/statalist/faq
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Eric Booth <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
Prev by Date: Re: st: e(wexp) versus e(wexp): different routines return different things
Next by Date: Re: st: placing two textboxes in a graph on the x-axis
Previous by thread: Re: st: regexm
Next by thread: st: e(wexp) versus e(wexp): different routines return different things
Index(es):
- Date
- Thread