Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm

From	Eric Booth <[email protected]>
To	"<[email protected]>" <[email protected]>
Subject	Re: st: regexm
Date	Sat, 27 Aug 2011 01:28:42 +0000

<>

Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
  There are other ways to approach this, but I think the logic here is easy to follow:

*************! watch for wrapping:

**example data:
clear
inp str70(j)
"A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
"A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
"A: 1.0%-$109(M) 0.1% th_aft."
"A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
end



**regexm example == easier to use -split- initially
g example = regexs(0) ///
  if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
l
drop example


**split:
replace j = subinstr(j, "A: ", "", 1)
split j, p("(M) ")

**first, find x10 :
g x10 = ""

tempvar flag
g `flag' = ""
foreach var of varlist j? {
replace `flag' = "`var'" if ///
	strpos(`var', "th_aft")>0
replace x10  = subinstr(`var', "th_aft.", "", .) ///
  	 if `flag' == "`var'"
replace `var' = "" if strpos(`var', "th_aft")>0
	}


**now, create x1-x9 and y1-y9
forval num = 1/9 {
 g x`num' = ""
 g y`num' = ""
  cap replace x`num' = regexs(0) if ///
	regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
	& !mi(j`num') & mi(x`num') //probably overkill
  cap replace y`num' = regexs(0) if ///
	regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
	& !mi(j`num') & mi(y`num')
	}
**finally, create y10 == y2:
 g y10 = y2


****list:
l *1
l *2
l *3

*************!
- Eric

On Aug 26, 2011, at 6:59 PM, KOTa wrote:

> Dear statalisters,
> 
> I am trying to extract some data from text variable and being new to
> stata programming struggling with finding right format.
> 
> my problem is as following:
> 
> for example i have string variable as following: "A: 0.35%-$100(M)
> 0.30%-$300(M) 0.27% th_aft."
> 
> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
> 
> I have 10 pairs of variables X1 Y1 .... X10 Y10
> 
> my goal is to extract all pairs from the string variable and split
> them into my separate variables.
> 
> in this case the result should be:
> 
> X1  = 0.35%
> Y1 = $100
> 
> X2 = 0.30%
> Y2 = $300
> 
> X3-X9 = y3-Y9 = 0
> 
> X10 = 0.27%
> Y10 = Y2 (i.e. last Y extracted from sting)
> 
> I am trying to use regexm but unsuccessfully, Any suggestions?
> 
> 
> thank you in advance
> 
> C.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: regexm
  - From: Nick Cox <[email protected]>

References:
- st: regexm
  - From: KOTa <[email protected]>

Prev by Date: st: regexm
Next by Date: st: e(wexp) versus e(wexp): different routines return different things
Previous by thread: st: regexm
Next by thread: Re: st: regexm
Index(es):
- Date
- Thread