Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Cannot -file write- a line that was just successfully -file read-


From   David Elliott <dcelliott@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: Cannot -file write- a line that was just successfully -file read-
Date   Thu, 24 Jun 2010 10:28:12 -0300

I have two questions to the list with the supporting documentation
following them:

(1) What are the possible reasons for being able to -file read- a line
that one cannot -file write- and what additional processing of the
line macro might I do to avoid the error.

(2) I'd also like advice about whether it would be worthwhile writing
the file IO part of the routine below in mata for speed reasons, or
possibly to avoid problems with un-read/writable lines arising from
the presence of characters that cause problems in normal macro
processing.

Context:
I am undertaking a major rewrite of my large file chunking utility
-chunky- (-ssc describe chunky- for info) after a user encountered
problems with the routine halting at an unreadable/unwritable line in
a raw datafile he was chunking.

Approach:
The core of the chunking routine is this, a loop that -file read-s a
source file line by line and -file writes- the line to a destination
file.  The current -chunky- routine had no error trapping in this
routine and would abort for no apparent reason.  I have introduced
-capture-s into the read and write steps as follows:
=======code excerpt begin=======
forvalues r = 1/`lines' { // Move pointer to index line
  capture file read `in' line
  local rc = _rc
  if `r(eof)' == 1 { //end of file
    n di _n "{err:Terminating at end of file}"
    local eof 1
    continue, break
    }
  local ++index // increment infile line counter
  if _rc != 0 {
    n di _n "{err:chunky encountered unreadable data at {txt:file
index: }{res: `index'}}" _n ///
    "{err:debug info: {txt:r(eof) = }{res:`r(eof)'}  {txt:r(status) =
}{res:`r(status)'}}"
    }
  capture file write `out' `"`macval(line)'"' _n
  if _rc != 0 {
    n di _n "{err:chunky encountered unwritable line at file index
}{res: `index'}"
  }
}
=======code excerpt end=======

Here are some lines from the user's output log generated while
chunking a 10Gb raw data dump (note: this is from a rewritten version
of -chunky- called -chunky_new- for testing purposes and the syntax is
new as well.  The dots .. indicate successful completion of a chunk.):
. chunky_new using "X:\directory obscured\really really big file.TXT",
header(include) lines(500000) stub(chunk) replace

chunking ..
chunky encountered unwritable line at file index  688704
chunky encountered unwritable line at file index  770579
chunky encountered unwritable line at file index  998863
.
chunky encountered unwritable line at file index  1321586
...

What I find curious is that the offending line -file read-s OK, but an
error is captured at the -file write- step.  I have thoroughly read
the -file- reference in online and manual documentation.  The use of a
compound double quoted `"`macval(line)'"' is the accepted way of
outputting exactly what was read into the local macro line.  -file
write- saves two results r(eof) and r(status) after every operation
(plus, of course, the ubiquitous error return code _rc) and I have
debugging code to indicate if one of the trappable errors occurs such
as a too long or unterminated line.  Encountering an end of file
(r(eof)==1) simply breaks out of the loop.

Thank you.

(Incidentally, when this problem is solved, I will be replacing the
current version of chunky.ado that is on SCC with this one.  Anyone
with large file chunking/splitting needs who wishes to beta test this
new routine should contact me off-list.

Improvements in the beta version include:
* Significant speed increase
* Much simpler setup through reconceptualizing the actions of the routine
* Better syntax and error checking
* A peek(n) option to look at the first n lines of a file
* An analyze option to estimate the number of lines to read for
various chunk and Stata filesizes as well as check for potential
problems arising from extended ASCII characters )

--
David Elliott

Everything is theoretically impossible, until it is done.
Progress is made by lazy men looking for easier ways to do things.
-- Robert A. Heinlein (American science-fiction Writer, 1907-1988)

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index