Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: RE: RE: puzzling string conversion


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: RE: puzzling string conversion
Date   Thu, 10 Feb 2011 16:14:32 +0000

But there is no need to proceed character by character. 

	replace id = regexr(id,"[^0-9]*","")

should speed things up a bit. 

Nick 
n.j.cox@durham.ac.uk 

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
Sent: 10 February 2011 15:56
To: 'statalist@hsphsun2.harvard.edu'
Subject: st: RE: RE: puzzling string conversion

Code closer to Dimitri's original is 

gen id = mystring
count if missing(real(id)) & (id != "") 

qui while r(N) {
	replace id = regexr(id,"[^0-9]","")
	count if missing(real(id)) & (id != "") 
}

destring id, gen(numid)
format numid %30.0f

Here r(N) is emitted by -count- and is non-zero (positive) while there's work still to do. 

Nick 
n.j.cox@durham.ac.uk 

Nick Cox

Your -while- condition will be interpreted as referring to -id[1]- regardless. 
It does not itself loop over the data. The -replace- statement would be sufficient in itself if the regexp is what you want. 

There are various solutions to extracting numeric characters only from a string. Here is another, more pedestrian in style. 

gen id = "" 
gen char = "" 
local length = substr("`: type mystring'",4,.) 

qui forval i = 1/`length' { 
	replace char = substr(mystring, `i', 1) 
	replace id = id + char if inrange(real(char), 0, 9)
} 

Dimitri Szerman

I got this puzzling result. I have a string variable, mystring, which
has both numeric and non-numeric characters. I'd like to extract only
the numeric ones, and form a numeric variable with this (in fact, it's
going to be an id). I'm using regular expressions, and this is what
I'm doing

input str30 mystring
"111.aaa.22.2/33-33"
"011.xyz.22.2/33-33"
"101.abc.22.2/33-33"
"222.foo.22.2/33-33"
"111.bla.22.2/33-33"
end

gen id = mystring
while regexm(id, "[^0-9]" ) {
 replace id = regexr(id,"[^0-9]","")
}
destring id, gen(numid)

And it works fine. However, if mystring has an observation which
contains very few (when compared to the other observations)
non-numeric characters, this seems to break down:

clear
input str30 mystring
"A"
"011.xyz.22.2/33-33"
"101.abc.22.2/33-33"
"222.foo.22.2/33-33"
"111.bla.22.2/33-33"
end

gen id = mystring
while regexm(id, "[^0-9]" ) {
 replace id = regexr(id,"[^0-9]","")
}
destring id, gen(numid)

Am I missing something? Why doesn't this work? Any suggestions?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index