Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Infile stops reading dataset


From   [email protected] (William Gould, Stata)
To   [email protected]
Subject   Re: st: Infile stops reading dataset
Date   Wed, 28 Mar 2007 07:49:25 -0500

Evan Roberts <[email protected]>

> Has anyone ever had infile stop reading a dataset part way through a 
> file for no apparent reason?
> 
> I have tested the dictionaries and the data reads in correctly for the 
> first 3534 lines of the file, and then stops (there are 778,429 lines in 
> the file -- it is hierarchical).

Pleasingly, Evan made available shortened versions of his data on the web.  I
have solved the problem, I think.  The file contains 0xff, which Stata is
mistakenly taking as an end-of-file.  I am recording that as a bug in 
Stata.  In the meantime, I have a workaround for Evan.


Solution
--------

In order to get the data to read, I typed 

        . filefilter no1875a.dat new.dat, from(\FFh) to(X)

That Stata command created a new file new.dat containing the original, 
but changing all 0xff characters to capital X.  

Before I did that, when I read the household part of no1875a.dat, -infile-
reported 691 observations.  After making the change and then reading new.dat,
-infile- reported 1841 observations.  Moreover, when I looked at a hexdump of
the original file, the first 0xff occurred on the line following the line at
which Stata stopped reading.  Thus, I am reasonably sure I have found the
problem.



0xff?  What's that?
-------------------

0xff is computer jargon for a character that has all bits on.  Thesedays, 
0xff is just another character, no different from "a" (0x61) or "z" 
(0x7a).  I believe that in no18765a.dat, 0xff is supposed to represent 
a y with an umlaut over it.

Anyway, in the early days of Stata, an operating system called MSDOS treated
0xff as the end-of-file marker, and Stata adopted that rule.  I thought that
we had removed the last remnants 0xff-means-end-of-file years ago, but
evidently not.

I chose to change 0xff to "X" because "X" never appeared in the file.
Once the data is read, Evan can change "X" back to y-umlaut.


How I discovered the problem
----------------------------

Anytime you have difficulty reading a file, try out -hexdump, analyze-.
I typed 

        . hexdump no1875a.dat, analyze 

and saw that -hexdump- flagged the file as binary.  Everything -hexdump- 
reported looked reasonable, except it mentioned that there were 3 
"Extended Control Characters", and I knew that was odd.  Next, I typed 

        . hexdump no1875a.dat, tabulate 

which gave me a tabulation of every character in the file.  The 0xff 
jumped out at me, although there would be no reason why it should jump out 
at anybody else.  As I said, these days, 0xff is just another character, 
but I knew Stata's history.

It was from -hexdump no1875a.dat, tabulate- that I learned "X" was never 
used.  So I used -filefilter- to change 0xff to X and then tried to 
read the data.  It worked.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index