Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Re: how to extract numeric part of a string


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Re: how to extract numeric part of a string
Date   Tue, 17 Dec 2002 12:02:44 -0000

Dale Plummer

> > If I have a string variable, is there a way to extract
> > only the number component?
> >
> > Examples:
> >
> > Var_source        Var_target
> > Abc123             123
> > Dog34              34
> > 1209               1209
> > cat                 .

Scott Merryman

> A brute force method you can use the destring command with the
ignore option.
> However, you have to specify all the characters to be ignored.  E.g.

> . list

>          var1
>  1.    abc123
>  2.    ab1234
>  3.     abcd1
>  4.    a12345
>  5.    142332

> . destring var1 , gen(new) i(a b c d )
> var1: characters a b c d removed; new generated as long

> . list

>          var1         new
>  1.    abc123         123
>  2.    ab1234        1234
>  3.     abcd1           1
>  4.    a12345       12345
>  5.    142332      142332

Michael Blasnik

> I have needed this function and include below a small ado
> that does this
> (although you could grab just the loop and type it
> interactively instead):
>
> program define extrnum
> version 7
> syntax varlist(max=1) , gen(str)
> local maxlen: type `varlist'
> local maxlen=substr("`maxlen'",4,.)
> tempvar work
> qui gen str1 `work'=""
> forvalues i=1/`maxlen' {
>  qui replace `work'=`work'+substr(`varlist',`i',1) if
> real(substr(`varlist',`i',1)<.
> }
> gen `gen'=real(`work')
> end
>
> *Warning: you should check to make sure that email doesn't
> split the long
> line in the forvalues loop (there should be just a single
> command in the
> loop).
>
> for your example, the syntax would be:
>
> extrnum Var_source, gen(Var_target)
>
> You should be aware that this will simply pull together all
> numbers in the
> variable of interest, regardless of any intervening
> characters, so 12test34
> would become 1234.  Also, it will not extract negative
> signs, so all results
> will be positive.

Two tactics
===========

The replies from Scott and Michael illustrate
the two basic tactics here, to omit characters
you don't want and to select characters you
do want.

-destring-
==========

The "official Stata" answer to this problem would
certainly include a mention of -destring-.
I suppose that -destring- is optimistic
in assuming that at most you have a few
problematic characters which you can spell out
to -ignore()-.

-egen, sieve()-
===============

On the unofficial side, the package -egenmore-
from SSC includes this function, which I think
owes something to a question offlist from Gerald
Wright:

sieve(strvar) , { keep(classes) | char(chars) | omit(chars) }
selects characters from strvar according to a specified criterion
and generates a new string variable containing only those characters.
This may be done in three ways. First, characters are classified using
the keywords alphabetic (any of a-z or A-Z), numeric (any of 0-9),
space or other. keep() specifies one or more of those classes:
keywords may be abbreviated by as little as one letter. Thus keep(a n)
selects alphabetic and numeric characters and omits spaces and other
characters. Note that keywords must be separated by spaces.
Alternatively,
char() specifies each character to be selected or omit() specifies
each
character to be omitted. Thus char(0123456789.) selects numeric
characters and the stop (presumably as decimal point); omit(" ")
strips
spaces and omit(`"""') strips double quotes. (Stata 7 required.)

Note that some users may want to regard "," as
a numeric character. Although not mentioned
in the extract from the help just given, the negative
sign could and in some cases should be specified
explicitly; alternatively

egen newvar = sieve(strvar), omit(a)

may suffice in some applications.

More generally, I will mention that -egenmore- includes
several functions for working with strings.

Comment on Michael's -extrnum-
==============================

In addition to the negative sign question, Michael's program will omit
decimal points, commas or indeed any characters
other than 0-9. Michael's condition

if real(substr(`varlist',`i',1)<.

could be extended in a copy of his program, e.g. by

| index("-.", substr(`varlist',`i',1))

That is, the first part of the argument
to -index()- is a list of allowed characters.

-charlist-
==========

Finally, this utility may be of use or interest:

=========== begin
program def charlist, rclass
*! NJC 1.0.0 17 Dec 2002
	version 7
	syntax varname(string) [if] [in]
	marksample touse, novarlist
	* not 0: see [P] file formats .dta
	forval i = 1/255 {
		capture assert index(`varlist', char(`i')) == 0 if `touse'
		if _rc {
			local c = char(`i')
			local charlist "`charlist'`c' "
			local numlist "`numlist'`i' "
		}
	}
	di as text "`charlist'"
	return local charlist "`charlist'"
	return local numlist "`numlist'"
end
========== end

Examples:

. gen str3 bar = "bar"

. charlist bar
a b r

. charlist make if !foreign
 - . 7 8 9 A B C D E F G H I L M N O P R S T V X Z a b c d e f g h i k
l m n o p q r s t u v w x y z

. ret li

macros:
           r(numlist) : "32 45 46 55 56 57 65 66 67 68 69 70 71 72 73
76 77 78 79 80 82 83 84 86 88 90 97
> 98 99 100 101 102 103 104 105 107 108 109 110 111 112 113 114 115
116 117 118 119 120 121 122 "
          r(charlist) : " - . 7 8 9 A B C D E F G H I L M N O P R S T
V X Z a b c d e f g h i k l m n o p
> q r s t u v w x y z "

Some possible uses of -charlist-:

1. See what characters are present in
a string variable. Non-printable characters
are decodable via -r(numlist)- and -char()-.

2. Copy and paste part of the character
set to -destring-'s -ignore()- option.

Perhaps the character list should not
be written out space-separated. Or
perhaps two versions should be emitted,
one space-separated, and the other not.
I welcome views.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index