Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Cannot -file write- a line that was just successfully -file read-


From   "James Beard" <james@beard.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Cannot -file write- a line that was just successfully -file read-
Date   Thu, 01 Jul 2010 12:43:17 -0000

One possible cause of your problem is that 

	file write `out' `"`macval(line)'"' _n

will fail if the macro line contains the backquote character -`- 
(despite what the documentation says about quoting strings).

Moving to Mata should get round this (and make your code faster).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

I have two questions to the list with the supporting documentation
following them:

(1) What are the possible reasons for being able to -file read- a 
line that one cannot -file write- and what additional processing of 
the line macro might I do to avoid the error.  

(2) I'd also like advice about whether it would be worthwhile writing
the file IO part of the routine below in mata for speed reasons, or
possibly to avoid problems with un-read/writable lines arising from
the presence of characters that cause problems in normal macro
processing.

Context:
I am undertaking a major rewrite of my large file chunking utility
-chunky- (-ssc describe chunky- for info) after a user encountered
problems with the routine halting at an unreadable/unwritable line in
a raw datafile he was chunking.

Approach:
The core of the chunking routine is this, a loop that -file read-s a
source file line by line and -file writes- the line to a destination
file.  The current -chunky- routine had no error trapping in this
routine and would abort for no apparent reason.  I have introduced
-capture-s into the read and write steps as follows:
=======code excerpt begin=======
forvalues r = 1/`lines' { // Move pointer to index line
  capture file read `in' line
  local rc = _rc
  if `r(eof)' == 1 { //end of file
    n di _n "{err:Terminating at end of file}"
    local eof 1
    continue, break
    }
  local ++index // increment infile line counter
  if _rc != 0 {
    n di _n "{err:chunky encountered unreadable data at {txt:file
index: }{res: `index'}}" _n ///
    "{err:debug info: {txt:r(eof) = }{res:`r(eof)'}  {txt:r(status) =
}{res:`r(status)'}}"
    }
  capture file write `out' `"`macval(line)'"' _n
  if _rc != 0 {
    n di _n "{err:chunky encountered unwritable line at file index
}{res: `index'}"
  }
}
=======code excerpt end=======

Here are some lines from the user's output log generated while 
chunking a 10Gb raw data dump (note: this is from a rewritten version 
of -chunky- called -chunky_new- for testing purposes and the syntax 
is new as well.  The dots .. indicate successful completion of a 
chunk.): . chunky_new using "X:\directory obscured\really really big 
file.TXT", header(include) lines(500000) stub(chunk) replace  

chunking ..
chunky encountered unwritable line at file index  688704
chunky encountered unwritable line at file index  770579
chunky encountered unwritable line at file index  998863
.
chunky encountered unwritable line at file index  1321586
...

What I find curious is that the offending line -file read-s OK, but 
an error is captured at the -file write- step.  I have thoroughly 
read the -file- reference in online and manual documentation.  The 
use of a compound double quoted `"`macval(line)'"' is the accepted 
way of outputting exactly what was read into the local macro line.  -
file write- saves two results r(eof) and r(status) after every 
operation (plus, of course, the ubiquitous error return code _rc) and 
I have debugging code to indicate if one of the trappable errors 
occurs such as a too long or unterminated line.  Encountering an end 
of file (r(eof)==1) simply breaks out of the loop.  

Thank you.

(Incidentally, when this problem is solved, I will be replacing the
current version of chunky.ado that is on SCC with this one.  Anyone
with large file chunking/splitting needs who wishes to beta test this
new routine should contact me off-list.

Improvements in the beta version include:
* Significant speed increase
* Much simpler setup through reconceptualizing the actions of the 
routine
* Better syntax and error checking
* A peek(n) option to look at the first n lines of a file
* An analyze option to estimate the number of lines to read for
various chunk and Stata filesizes as well as check for potential
problems arising from extended ASCII characters )

--
David Elliott

Everything is theoretically impossible, until it is done.
Progress is made by lazy men looking for easier ways to do things.
-- Robert A. Heinlein (American science-fiction Writer, 1907-1988)


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index