Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Remove prefixes (e.g., >, <, and +/-) from numbers stored as strings


From   Richard Herron <[email protected]>
To   [email protected]
Subject   Re: st: RE: Remove prefixes (e.g., >, <, and +/-) from numbers stored as strings
Date   Fri, 8 Jun 2012 17:13:48 -0400

Thanks, all! Good tips all around. I should go with the one-by-one
substitution using -subinstr()- to make sure that I know what I'm
doing.

@Steve -- Thanks for the functioning regex. Something like this works
and strips pre/postfixes.

* code
generate number2 = regexs(1) if regexm(combo,
"^[^0-9]*([0-9]*\.?[0-9]*)[^0-9]*$")
*

In this case there _shouldn't be negative values (only +/- to indicate
appoximate), but I should replace these characters one-by-one to be
sure of what I'm doing.

Richard Herron


On Fri, Jun 8, 2012 at 3:04 PM, Steve Samuels <[email protected]> wrote:
> A regular expression solution that allows for characters other than
> "> and %" at start and finish.
>
> Steve
> [email protected]
>
> ****************
> clear
> input str20 combo
> ">88.27821"
> "91.53401%"
> "       76m "
> " -31.20785"
> ">-52.18793"
> "39.94933%"
> "      +61"
> " 89.47855"
> " +75.43917"
> ">82.67717"
> "46.31095%"
> "       81"
> " 45.24185"
> " 28.62701"
> ">77.13605"
> "46.79793%"
> "       62"
> " 19.50868"
> " 91.54968"
> " 86.64407"
> end
> replace combo = trim(combo)
> des
> gen new1 =regexs(2)  ///
> if regexm(combo,"^([^0-9+-]?)((\+|\-)?[0-9]+\.?[0-9]+)([^0-9]?)$")
> destring new1,replace
> list
> ********************************************************************
>
> On Jun 8, 2012, at 2:40 PM, Nick Cox wrote:
>
> Cox's Third Law of string processing is "regex machinery is great, but
> always check first if something simpler will work directly".
>
> I really wouldn't want to support removing + and - characters
> separately. You could be removing genuine information!
>
> If the issue is solely the composite prefix, then
>
> subinstr(myvariable, "+/-", "", 1)
>
> is as direct as anything else for pre-processing. If need be you can
> of course insist that the prefix must be a prefix
>
> ... if substr(myvariable, 1, 3) == "+/-"
>
> The single character is char(177) in my flavour of Stata. Try
> -asciiplot- (SSC) to see if yours agrees
>
> subinstr(myvariable, char(177), "", .)
>
> is what I would try.
>
> I like -destring- too.
>
> Nick
>
> On Fri, Jun 8, 2012 at 7:12 PM, Richard Herron
> <[email protected]> wrote:
>
>> Thanks, David! That's big. I hadn't noticed the -ignore()- option in -destring-.
>>
>> But what if I don't know the set of possible prefixes? I guess
>> -destring- will throw an error and I iteratively improve my filter?
>>
>> I have some where +/- is almost like a LaTeX \pm symbol where the + is
>> stacked on the -. I think this is unicode U+00B1.
>> http://www.fileformat.info/info/unicode/char/b1/index.htm
>>
>> Can I use -destring- to -ignore()- these?
>
>> On Fri, Jun 8, 2012 at 1:59 PM, David Radwin <[email protected]> wrote:
>>> Can you use -destring- with the -ignore- option like this?
>>>
>>> . destring myvariable, ignore("+/-<>") generate(myvariable2)
>>>
>>> David
>>> --
>>> David Radwin
>>> Senior Research Associate
>>> MPR Associates, Inc.
>>> 2150 Shattuck Ave., Suite 800
>>> Berkeley, CA 94704
>>> Phone: 510-849-4942
>>> Fax: 510-849-0794
>>>
>>> www.mprinc.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:owner-
>>>> [email protected]] On Behalf Of Richard Herron
>>>> Sent: Friday, June 08, 2012 10:30 AM
>>>> To: [email protected]
>>>> Subject: st: Remove prefixes (e.g., >, <, and +/-) from numbers stored as
>>>> strings
>>>>
>>>> I have numbers stored as string with prefixes (e.g., "+/-30") that I
>>>> would like to convert to numbers. Not all entries necessarily have
>>>> prefixes (or postfixes).
>>>>
>>>> With -regexm()- and -regexs()- I can remove from postfixes and handle
>>>> decimals, but I can't remove prefixes. Can you spot my error with
>>>> -regexm()-? Thanks!
>>>>
>>>> Richard Herron
>>>>
>>>> * begin code
>>>> clear
>>>> set obs 20
>>>> generate number = 100*runiform()
>>>> generate prefix = ""
>>>> generate postfix = ""
>>>> foreach i of numlist 1 5 10 15 {
>>>>     replace prefix = ">" in `i'
>>>>     replace postfix = "%" in `=`i' + 1'
>>>>     replace number = int(number) in `=`i' + 2'
>>>> }
>>>> egen combo = concat(prefix number postfix)
>>>> generate number2 = regexs(1) if regexm(combo, "([0-9]*\.?[0-9]*)")
>>>> list
>>>> * end code
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index