Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Error 612 on .dta in Stata 13.1


From   "William Gould, StataCorp LP" <[email protected]>
To   [email protected]
Subject   Re: st: Error 612 on .dta in Stata 13.1
Date   Mon, 09 Dec 2013 14:07:01 -0600

Sergiy Radyakin <[email protected]> reports having two old .dta
files that Stata 11 and 12 can -use- without problem, but that StataMP
13.1 refuses to read, instead saying

    . use "datafile.dta", clear
    .dta file corrupt
        The file unexpectedly ended before it should have.
    r(612);

Sergiy is looking for advice and cannot share the data files.

Sergiy used -hexdump- or something on the file and reports that they 
are specification 114, meaning they are from Stata 10.


Why can Stata 11 and 12 read the data, but not Stata 13?
--------------------------------------------------------

Stata 13 is far more demanding that .dta files match the expected
format than any previous version of Stata.  We changed the code and we
changed the file format so that Stata could better determine when a
problem arose.

These are old files and so Stata 13 is more limited on the kinds of
problems it can detect, but the code is still being more demanding.

That is why stata 13 cannot read the files but Stata 11 and 12 can.


An assumption I am making
-------------------------

Sergiy can read the data using a previous version of Stata, he says.  I
am assuming that, using the OLD Stata, if Sergiy types

        . use <originaldataset>

        . save copy

and then if Sergiy switches to Stata 13 and types 

        . use copy

the dataset loads without error.  If that is not true, then either 
there is an bug in Stata 13 or the orignal dataset is corrupt, and 
just reading the corrupted dataset corrupted the OLD Stata session.

At that point, Sergiy needs to talk to us, because we will want to 
determine which is the case.  We can sign nondisclosure forms.


How to determine how serious the error is
-----------------------------------------

Let's assume that using and saving the original data with the OLD Stata
results in a datset Stata 13 can read.

Let me outline the process we would follow if Sergiy could send us the 
dataset:

    1.  In Stata 13, type -help dta-.  Click on "114".
        Unfortunately, when I did that, I discovered a minor error in 
        our help file.  Further down, the file talks about "115"
        datasets even though I had clicked on 114.

        Do not panic.  Stata 114 and 115 formats are identical.  They
        differ only in that Stata 115 might contain %tb formats for 
        date variables, whereas Stata 114 datasets cannot.

    2.  First, I want Sergiy to use -hexdump- to obtain the header.
	In Stata 13, type 

		. set more on
		. log using <whatever>
		. hexdump <filename>.dta
                  (Press -break- when screen fills up)
		. log close

    3.  Here is how you read the 114 and 115 formats:

	Byte 1:  A byte contains two hexadecmial (base 16) digits. 
            Thus, byte one contains two digits. 

            Those two digits will be 0x72 or 0x73.  When I write 0x in
            front of a number, I mean that the number is recorded in
            hexadecimal.  What the byte actually contains -- and what 
            the dump actually shows -- is "72" or "73".

            FYI, 0x72 = 114 and 0x73 = 115.  That's how Sergiy knew the
            dataset format.

        Byte 2:  Contains 0x01 or 0x02, meaining HILO or LOHI byte
            ordering, respectively.  We are gong to need the byte order
            to interpet bytes 5-6 and 7-10 later.  If the byte order is
            HILO, we can just read the numbers just as as they are
            written.  If the byte order is LOHI, we will have to
            reverse the order of pairs of digits.  I will explain when
            the problem arises.

        Byte 3:  Contains 0x01.  It always contains this when the dataset 
            format is 114 or 115.

        Byte 4:  Contains 0x00.  It always contains this when the dataset 
            format is 114 or 115. 

        Bytes 5-6:  contains a four-digit hexadecimal number.  That
             four-digit number says how many variables are in the
             dataset.

             Let's pretend our file contains 0x0a0b.  

             If the byte order (byte 2) is HILO, we can translate
             directly from base 16 to base 10:  We have hex number 
             a0b, we type -inten 16 a0b-, and learn the dataset 
             contains 2,471 variables. 

             If the byte order is LOHI, however, must must first reverse 
             the bytes.  Remember, each byte contains 2 digits.  Thus, 
             Thus (LOHI) 0x0a0b = (HILO) 0x0b0a.  So we type -inten 16 b0a-
             and learn the dataset contains 2,826 variables. 

        Bytes 7-10:  contains an eight-digit hexadecmial number
             corresponding to the number of observations.

             Let's pretend out datset contains 0x0002fa03.
  
             Just as before, we can read it it from left-to-right if 
             the byteorder is HILO.  We type -inten 16 2fa03- and learn
             we have 195,075 observations.

             If numbers are stored in LOHI format, we must reverse 
             the digits; (LOHI) 0002fa03 = (HILO) 03fa0200.  
             We type -inten 16 3fa0200- and learn our dataset contains 
             66,716,160 observations. 

Okay, now we know the number of variables and number of observations the
dataset SHOULD contain.

Sergiy was able to read the dataset with a previous version of Stata. 

How many observations does the old Stata report?  It needs to match 
or the dataset is corrupted. 

Now, look at the last observation.  Type,

        . list in l

In theory, it makes no difference whether Sergiy does this with an OLD 
Stata or Stata 13.  If I were Sergiy, I'd do it both ways just for my 
own peace of mind.

Anyway, look at the the last observation.  Look especially at the end
variables.  Do they look correct?  If they look correct, they probably
are correct.  Corrupt data usually looks corrupt because values will be
out of range.  A person's age won't randomly change from 48 to a number
within the reasonable range for ages; it is more likely to randomly
change to a number outside of that range because there are so many more 
of them.

I'd probably trust the data if the last obsrvaiton looked good. 


More to do
----------

After the data, the next and last thing recorded in the 114 and 115 format 
datasets are the value labels. 

If the file was shortened, it is likely that not all value labels that 
should be defined are defined, and possibly the last value label does not 
have all the labels defined that it should.  

Here at StataCorp, we would do the following: 

	. set more off 
	. log using fulllog
        . hexdump <originalfile>.dta
	. log close

and we would look at the end of the log.

I am also wondering whether the file was not shortened, but
accidentally lengthened, say by a mailer adding linefeed or carriage
return and linefeed to the end of the file.  Linefeed is 0x0a and
carriage return 0x0d.

Does the file end in 0x0d0a or in 0x0a?  

I hope this helps. 

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index