Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Suggestions on learning Perl


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: Suggestions on learning Perl
Date   Wed, 27 Nov 2002 12:57:22 -0000

An answer may be still be of interest, despite 
Frank's later posting that he is going to grapple 
with Python. 

The best books on Perl almost all come from O'Reilly: see 

http://perl.oreilly.com/

The original "Programming Perl" and one of several later 
books "Learning Perl" are both now in their 3rd editions. 

Perl is a wonderful thing: one problem now, however, 
is that just to identify the subset which is going to 
be of most use to you can take a fair while because 
the whole language and all sorts of user-written 
add-ons are in total very large (sound familiar?). 

I have great admiration for Perl, and have 
used it a bit, but I have an offbeat suggestion
which real Perl experts will sniff at. 
They will say that what I am going to suggest 
has long since been superseded by Perl -- and they 
are in a sense totally right. 

Awk. 

There are two great advantages to Awk: 

1. It is a small, compact language. 
The original book by Aho, Kernighan 
and Weinberger (permute AKW) is 
still in print, very slim and very, 
very good. Awk is a language which can be 
learned quickly. 

2. It has a very narrow view of the 
world: its mindset is that it expects 
to be looking at a text file line by 
line. It can be made to do other things
fairly easily, but it is built largely 
for that purpose -- which for Stata users 
is of course very often exactly what you want.  
Data files, program files, log files: all 
have a definite line-based structure. 

Yesterday before Frank's posting 
I had this problem: 

Meteorological observations here 
at Durham have been made for >150 years 
but over that time there have been many 
changes in what has been measured. 
The data come as a series of annual ASCII files 
1849.dat, 1850.dat, ... with a series 
of Stata dictionaries saying which 
variables were measured in what years. 
My starting point for one analysis is a 
.do file reading in from each .dat file and ending with a 
-save- to an annual .dta file. 

The end point is one loop 

use 1849 
forval i = 1850/1997 { 
	append using `i' 
} 

which (eventually) took an eye's blink. 

But in the middle there were 
problems. With one year there was 
a stream of error messages which 
implied that Stata was seeing  
fewer data items that it expected. 

A call to Awk something like 

awk " { print NF } " 1859.dat 

printed 19 again and again as the number 
of fields in each record in 
a long stream, so that was true of 
at least most of the lines in the file, 
except that a test 

awk " NF != 19 { print NR, NF } " 1859.dat

revealed a line with 18 fields: what 
should have been an explicit missing 
was in fact a blank, with knock-on 
effects throughout the rest of the file. 

What is going on here? 

1. Awk is looking at each record 
(by default a line) in the file specified. 

2. " { print NF } " 

is a complete Awk program. It has the form 

" { <action> } " 

and an <action> like this is automatically 
executed for each record in the file. 

3. NF is an example of a built-in 
variable, and gives the number 
of fields (by default fields 
are separated by white space). 

4. " NF != 19 { print NR, NF } "

is a complete Awk program. It has the form 

" <pattern> { <action> } " 

and <action> is executed if and only 
if <pattern> is satisfied by a record. In this 
case, if the number of fields is not 19, 
the program prints the number of the record 
(NR is another built-in variable) and the number of 
fields on that record. 

What's important here are not the details -- nor the 
fact that there are other ways to tackle this, 
including all-Stata solutions -- but the notion that 
programs can be written on the fly both easily 
and effectively. 

Some other uses of Awk with Stata were written up as 

STB-19  os13 . . . Using awk and fgrep for selective 
extraction from log files 5/94    pp.15--17; 
STB Reprints Vol 4, pp.78--80        

explanation of how to use awk to selectively extract comments
from log files; explanation of how to use fgrep to selectively
extract lines from log files

An unorthodox introduction to an unorthodox 
language is included in 

A conversation on Awk. Computers & Geosciences 
21, 1-6 (and 1119) (1995) 

although despite my best efforts some typos 
appear in the text. 

Nick 
[email protected] 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index