Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Help with string problem


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Help with string problem
Date   Fri, 25 Aug 2006 18:39:00 +0100

In addition, "space" for -egen, sieve()- means " " and 
doesn't include any characters that just print as spaces because 
they're otherwise unprintable. 

As always, you may be better with knitting a solution 
in exactly the same size as your problem. 

Given a string variable -badid-, let's suppose we regard 
legitimate characters to be a-z A-Z 0-9. 

gen goodid = "" 
gen length = length(badid) 
su length, meanonly 

qui forval i = 1/`r(max)' { 
	replace goodid = goodid + substr(badid,`i',1) /// 
		if inrange(substr(badid,`i',1),"a","z") | ///
		inrange(substr(badid,`i',1),"A","Z") |    /// 
		inrange(substr(badid,`i',1),"0","9") 
}

drop length 

The recipe appears fairly general: just tune the -if- 
condition. My guess is that 
stuff you want to keep will always be printable and 
fall into a few small classes. 

Two small morals are that we do not need to fool around 
with -char()- or its elusive inverse -ascii()-, 
and that -inrange()- applies to strings too. 

. di inrange("Bush","Lincoln","Roosevelt")
0

Isn't Stata well-informed as well as smart? 

Nick 
[email protected] 

Nick Cox
 
> -omit(space)- confuses syntaxes and will not 
> do what you think it will. It omits "s", "p", etc. 

Fred Wolfe
  
> > That is a great egen. But it doesn't seem to work 
> completely to omit 
> > HEX(A0), unless I have done something wrong. Always likely.
> > 
> > 
> > . use fwbids,clear
> > . egen apatkey2 = sieve(apatkey),  keep(a n o)
> > . gen l1 = length(apatkey)
> > . gen l2 = length(apatkey2)
> > 
> > . egen apatkey3 = sieve(apatkey2),  omit(space)
> > . gen l3 = length(apatkey3)
> > 
> > . egen apatkey4 = sieve(apatkey3),  keep(a n)
> > . gen l4 = length(apatkey4)
> > 
> >        
> > +-------------------------------------------------------------
> > --------------------------+
> >        |      apatkey   greger       apatkey2   l1   l2       
> > apatkey3   l3 
> >       apatkey4   l4 |
> >        
> > |-------------------------------------------------------------
> > --------------------------|
> >     1. | 
> > ABI000000-01        1   ABI000000-01   12   12   ABI000000-01   12 
> > ABI00000001   11 |
> >     2. | 
> > AHR000000           1    AHR000000     12   11    AHR000000     11 
> > AHR000000    9 |
> >     3. | 
> > AHR360227           1    AHR360227     12   11    AHR360227     11 
> > AHR360227    9 |
> >     4. | 
> > ALB431118           1    ALB431118     12   11    ALB431118     11 
> > ALB431118    9 |
> >     5. | 
> > ALD771122           1    ALD771122     12   11    ALD771122     11 
> > ALD771122    9 |
> >        
> > |-------------------------------------------------------------
> > --------------------------|
> > 
> > 
> > 
> > 
> > At 10:13 AM 8/25/2006, Nick Cox wrote:
> > >"you" here presumably meaning Fred's collaborators.
> > >
> > >There is a home-grown -egen- function called -sieve()-
> > >in -egenmore- from SSC that could be used to keep
> > >alphanumeric characters only.
> > >
> > >Nick
> > >[email protected]
> > >
> > >Rafal Raciborski
> > >
> > > > you could also use the clean() function in excel first, 
> > which removes
> > > > all nonprintable characters, before pasting into stata.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index