Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Error 612 on .dta in Stata 13.1
"William Gould, StataCorp LP" <[email protected]>
[email protected]
Re: st: Error 612 on .dta in Stata 13.1
Mon, 09 Dec 2013 14:07:01 -0600
Sergiy Radyakin <[email protected]> reports having two old .dta
files that Stata 11 and 12 can -use- without problem, but that StataMP
13.1 refuses to read, instead saying
. use "datafile.dta", clear
.dta file corrupt
The file unexpectedly ended before it should have.
Sergiy is looking for advice and cannot share the data files.
Sergiy used -hexdump- or something on the file and reports that they
are specification 114, meaning they are from Stata 10.
Why can Stata 11 and 12 read the data, but not Stata 13?
Stata 13 is far more demanding that .dta files match the expected
format than any previous version of Stata. We changed the code and we
changed the file format so that Stata could better determine when a
problem arose.
These are old files and so Stata 13 is more limited on the kinds of
problems it can detect, but the code is still being more demanding.
That is why stata 13 cannot read the files but Stata 11 and 12 can.
An assumption I am making
Sergiy can read the data using a previous version of Stata, he says. I
am assuming that, using the OLD Stata, if Sergiy types
. use <originaldataset>
. save copy
and then if Sergiy switches to Stata 13 and types
. use copy
the dataset loads without error. If that is not true, then either
there is an bug in Stata 13 or the orignal dataset is corrupt, and
just reading the corrupted dataset corrupted the OLD Stata session.
At that point, Sergiy needs to talk to us, because we will want to
determine which is the case. We can sign nondisclosure forms.
How to determine how serious the error is
Let's assume that using and saving the original data with the OLD Stata
results in a datset Stata 13 can read.
Let me outline the process we would follow if Sergiy could send us the
1. In Stata 13, type -help dta-. Click on "114".
Unfortunately, when I did that, I discovered a minor error in
our help file. Further down, the file talks about "115"
datasets even though I had clicked on 114.
Do not panic. Stata 114 and 115 formats are identical. They
differ only in that Stata 115 might contain %tb formats for
date variables, whereas Stata 114 datasets cannot.
2. First, I want Sergiy to use -hexdump- to obtain the header.
In Stata 13, type
. set more on
. log using <whatever>
. hexdump <filename>.dta
(Press -break- when screen fills up)
. log close
3. Here is how you read the 114 and 115 formats:
Byte 1: A byte contains two hexadecmial (base 16) digits.
Thus, byte one contains two digits.
Those two digits will be 0x72 or 0x73. When I write 0x in
front of a number, I mean that the number is recorded in
hexadecimal. What the byte actually contains -- and what
the dump actually shows -- is "72" or "73".
FYI, 0x72 = 114 and 0x73 = 115. That's how Sergiy knew the
dataset format.
Byte 2: Contains 0x01 or 0x02, meaining HILO or LOHI byte
ordering, respectively. We are gong to need the byte order
to interpet bytes 5-6 and 7-10 later. If the byte order is
HILO, we can just read the numbers just as as they are
written. If the byte order is LOHI, we will have to
reverse the order of pairs of digits. I will explain when
the problem arises.
Byte 3: Contains 0x01. It always contains this when the dataset
format is 114 or 115.
Byte 4: Contains 0x00. It always contains this when the dataset
format is 114 or 115.
Bytes 5-6: contains a four-digit hexadecimal number. That
four-digit number says how many variables are in the
Let's pretend our file contains 0x0a0b.
If the byte order (byte 2) is HILO, we can translate
directly from base 16 to base 10: We have hex number
a0b, we type -inten 16 a0b-, and learn the dataset
contains 2,471 variables.
If the byte order is LOHI, however, must must first reverse
the bytes. Remember, each byte contains 2 digits. Thus,
Thus (LOHI) 0x0a0b = (HILO) 0x0b0a. So we type -inten 16 b0a-
and learn the dataset contains 2,826 variables.
Bytes 7-10: contains an eight-digit hexadecmial number
corresponding to the number of observations.
Let's pretend out datset contains 0x0002fa03.
Just as before, we can read it it from left-to-right if
the byteorder is HILO. We type -inten 16 2fa03- and learn
we have 195,075 observations.
If numbers are stored in LOHI format, we must reverse
the digits; (LOHI) 0002fa03 = (HILO) 03fa0200.
We type -inten 16 3fa0200- and learn our dataset contains
66,716,160 observations.
Okay, now we know the number of variables and number of observations the
dataset SHOULD contain.
Sergiy was able to read the dataset with a previous version of Stata.
How many observations does the old Stata report? It needs to match
or the dataset is corrupted.
Now, look at the last observation. Type,
. list in l
In theory, it makes no difference whether Sergiy does this with an OLD
Stata or Stata 13. If I were Sergiy, I'd do it both ways just for my
own peace of mind.
Anyway, look at the the last observation. Look especially at the end
variables. Do they look correct? If they look correct, they probably
are correct. Corrupt data usually looks corrupt because values will be
out of range. A person's age won't randomly change from 48 to a number
within the reasonable range for ages; it is more likely to randomly
change to a number outside of that range because there are so many more
of them.
I'd probably trust the data if the last obsrvaiton looked good.
More to do
After the data, the next and last thing recorded in the 114 and 115 format
datasets are the value labels.
If the file was shortened, it is likely that not all value labels that
should be defined are defined, and possibly the last value label does not
have all the labels defined that it should.
Here at StataCorp, we would do the following:
. set more off
. log using fulllog
. hexdump <originalfile>.dta
. log close
and we would look at the end of the log.
I am also wondering whether the file was not shortened, but
accidentally lengthened, say by a mailer adding linefeed or carriage
return and linefeed to the end of the file. Linefeed is 0x0a and
carriage return 0x0d.
Does the file end in 0x0d0a or in 0x0a?
I hope this helps.
-- Bill
[email protected]
* For searches and help try: