Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: AW: RE: RE: RE: FW: Using regex to identify strings with capital letters

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: RE: AW: RE: RE: RE: FW: Using regex to identify strings with capital letters
Date	Thu, 27 May 2010 12:58:36 +0100

The sort order for strings is, to the best of my knowledge, the order of
-char()-. -asciiplot- from SSC is one of various ways to make it
graphic. 

Nick 
[email protected] 

Martin Weiss

So Erik can play around with the example code I created last night to
explain the -inrange()- thing to myself. Strangely, I did not find a
really
good source on the -sort- order for strings, in particular with regard
to
upper and lower case issues. The index points to [U], 13.2.3, which is a
little terse.

*************
clear*

//from NJC`s http://www.stata-journal.com/sjpdf.html?articlenum=pr0013
//follow special sequences


//create dataset
set obs 104

gen str2 myvar=""

token `c(ALPHA)'

forv i=1/26{
	replace myvar="``i''"+"``i''" in `i'
	replace myvar="``i''" in `=`i'+26'
	replace myvar="``i''" in `=`i'+52'
}

token `c(alpha)'

forv i=1/26{
	replace myvar=myvar+"``i''" in `=`i'+26'
	replace myvar="``i''"+myvar in `=`i'+52'
	replace myvar="``i''"+"``i''" in `=`i'+78'
}

//try -inrange()-
l if inrange(myvar,"aA","zZ")
l if inrange(myvar,"AA","ZZ")

//try -regexm()-
l if regexm(myvar, "^([A-Z][A-Z])")
l if regexm(myvar, "^([A-Z][a-z])")
l if regexm(myvar, "^([a-z][A-Z])")
l if regexm(myvar, "^([a-z][a-z])")
*************

Nick Cox

Martin is correct that -inrange("Er", "AA", "ZZ")- is true. Possibly
this is Erik's specific problem, namely that having the first capital
letter in "A" ... "Z" is necessary but not sufficient. 

I offer as a stronger criterion 

inrange(substr(myvar,1,1), "A", "Z") & inrange(substr(myvar,2,1), "A",
"Z")

I continue to like regex solutions when they are the simplest available!


Nick 
[email protected] 

Martin Weiss

Erik does have a point, though, in that Nick`s -inrange()- proposal
seems to
check for the first character only:

***********
di inrange(substr("erik in lower case",1,2) , "AA", "ZZ")
di inrange(substr("Erik in lower case",1,2) , "AA", "ZZ")
***********

BTW, why was -di inrange("erik in lower case", "AA", "ZZ")- a good
example
earlier, even though the -substr()- part was missing?

Nick Cox

Not true of my Stata: 

. di inrange("erik in lower case", "AA", "ZZ")
0

I think -- you've heard this before -- we need to see your code and some
of 
your results, not your speculation about what might be happening. 

Nick 
[email protected] 

Beecroft, Erik (VDSS)

I tried Nick's suggestion, pasted below, but inrange does not seem to
distinguish between lower and upper case.  In other words, the statement
below keeps all observations that begin with two letters, whether
capital or lower case.

Nick Cox 

You don't need regex for this. 

... if inrange(substr(myvar,1,2), "AA", "ZZ") 

should be enough, or even "AK" to "WY" or whatever it is. (Remember this
is an international list!) 

From: Beecroft, Erik (VDSS) 

I need to extract certain observations from a series of text files.
Each file contains only one variable, which is string.  The
observations I want all begin with two capital letters. (They are state
abbreviations, such as VA or AK).  The other observations do not begin
with two capital letters.

Is there a way to tell Stata to keep only observations for which the
variable begins with two capital letters?

It seems like the regex function might work, but I have never worked
with regular expression syntax before.  

For example, a portion of a text file might look like:
	text1
	text2
	VA department of Social Services
	text4
	text5

I want to keep only the third observation above.

I am using Stata for Windows 10.1.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: FW: Using regex to identify strings with capital letters
  - From: "Beecroft, Erik (VDSS)" <[email protected]>
- st: RE: FW: Using regex to identify strings with capital letters
  - From: "Nick Cox" <[email protected]>
- st: RE: RE: FW: Using regex to identify strings with capital letters
  - From: "Martin Weiss" <[email protected]>
- st: RE: RE: RE: FW: Using regex to identify strings with capital letters
  - From: "Nick Cox" <[email protected]>
- st: AW: RE: RE: RE: FW: Using regex to identify strings with capital letters
  - From: "Martin Weiss" <[email protected]>

Prev by Date: Re: st: How can I transfer the variable labels to excel using the outsheet command?
Next by Date: st: AW: RE: RE: RE: FW: Using regex to identify strings with capital letters
Previous by thread: st: AW: RE: RE: RE: FW: Using regex to identify strings with capital letters
Next by thread: st: RE: FW: Using regex to identify strings with capital letters
Index(es):
- Date
- Thread