[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Infile stops reading dataset
Evan Roberts <firstname.lastname@example.org>
> Has anyone ever had infile stop reading a dataset part way through a
> file for no apparent reason?
> I have tested the dictionaries and the data reads in correctly for the
> first 3534 lines of the file, and then stops (there are 778,429 lines in
> the file -- it is hierarchical).
Pleasingly, Evan made available shortened versions of his data on the web. I
have solved the problem, I think. The file contains 0xff, which Stata is
mistakenly taking as an end-of-file. I am recording that as a bug in
Stata. In the meantime, I have a workaround for Evan.
In order to get the data to read, I typed
. filefilter no1875a.dat new.dat, from(\FFh) to(X)
That Stata command created a new file new.dat containing the original,
but changing all 0xff characters to capital X.
Before I did that, when I read the household part of no1875a.dat, -infile-
reported 691 observations. After making the change and then reading new.dat,
-infile- reported 1841 observations. Moreover, when I looked at a hexdump of
the original file, the first 0xff occurred on the line following the line at
which Stata stopped reading. Thus, I am reasonably sure I have found the
0xff? What's that?
0xff is computer jargon for a character that has all bits on. Thesedays,
0xff is just another character, no different from "a" (0x61) or "z"
(0x7a). I believe that in no18765a.dat, 0xff is supposed to represent
a y with an umlaut over it.
Anyway, in the early days of Stata, an operating system called MSDOS treated
0xff as the end-of-file marker, and Stata adopted that rule. I thought that
we had removed the last remnants 0xff-means-end-of-file years ago, but
I chose to change 0xff to "X" because "X" never appeared in the file.
Once the data is read, Evan can change "X" back to y-umlaut.
How I discovered the problem
Anytime you have difficulty reading a file, try out -hexdump, analyze-.
. hexdump no1875a.dat, analyze
and saw that -hexdump- flagged the file as binary. Everything -hexdump-
reported looked reasonable, except it mentioned that there were 3
"Extended Control Characters", and I knew that was odd. Next, I typed
. hexdump no1875a.dat, tabulate
which gave me a tabulation of every character in the file. The 0xff
jumped out at me, although there would be no reason why it should jump out
at anybody else. As I said, these days, 0xff is just another character,
but I knew Stata's history.
It was from -hexdump no1875a.dat, tabulate- that I learned "X" was never
used. So I used -filefilter- to change 0xff to X and then tried to
read the data. It worked.
* For searches and help try: