Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Error 612 on .dta in Stata 13.1


From   Sergiy Radyakin <serjradyakin@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Error 612 on .dta in Stata 13.1
Date   Mon, 9 Dec 2013 19:09:18 -0500

Dear James and Bill,
thank you very much for your advice! The problem appears to be a
combination of all of the following:

1) data file being truncated, and
2) data file being corrupt within the remaining length, and
3) tolerance of earlier Statas to the data problem of truncated file
and its non-transparent handling of such corrupt files.

I've put a more verbose report here:
http://www.radyakin.org/statalist/statabugs/incomplete_f.htm
Ironically, I requested this behavior back in 2011. But it didn't
occur to me until after a couple of hours after I posted the email.

James, I am afraid unfolding Pyton on the server would not be an
immediate possibility for me, but I will one day try it with your
command. If you are unsure of how it is going to react, perhaps try it
with the replication script from here:
http://www.radyakin.org/statalist/statabugs/incomplete_file.htm

Best, Sergiy Radyakin



On Mon, Dec 9, 2013 at 3:07 PM, William Gould, StataCorp LP
<wgould@stata.com> wrote:
> Sergiy Radyakin <serjradyakin@gmail.com> reports having two old .dta
> files that Stata 11 and 12 can -use- without problem, but that StataMP
> 13.1 refuses to read, instead saying
>
>     . use "datafile.dta", clear
>     .dta file corrupt
>         The file unexpectedly ended before it should have.
>     r(612);
>
> Sergiy is looking for advice and cannot share the data files.
>
> Sergiy used -hexdump- or something on the file and reports that they
> are specification 114, meaning they are from Stata 10.
>
>
> Why can Stata 11 and 12 read the data, but not Stata 13?
> --------------------------------------------------------
>
> Stata 13 is far more demanding that .dta files match the expected
> format than any previous version of Stata.  We changed the code and we
> changed the file format so that Stata could better determine when a
> problem arose.
>
> These are old files and so Stata 13 is more limited on the kinds of
> problems it can detect, but the code is still being more demanding.
>
> That is why stata 13 cannot read the files but Stata 11 and 12 can.
>
>
> An assumption I am making
> -------------------------
>
> Sergiy can read the data using a previous version of Stata, he says.  I
> am assuming that, using the OLD Stata, if Sergiy types
>
>         . use <originaldataset>
>
>         . save copy
>
> and then if Sergiy switches to Stata 13 and types
>
>         . use copy
>
> the dataset loads without error.  If that is not true, then either
> there is an bug in Stata 13 or the orignal dataset is corrupt, and
> just reading the corrupted dataset corrupted the OLD Stata session.
>
> At that point, Sergiy needs to talk to us, because we will want to
> determine which is the case.  We can sign nondisclosure forms.
>
>
> How to determine how serious the error is
> -----------------------------------------
>
> Let's assume that using and saving the original data with the OLD Stata
> results in a datset Stata 13 can read.
>
> Let me outline the process we would follow if Sergiy could send us the
> dataset:
>
>     1.  In Stata 13, type -help dta-.  Click on "114".
>         Unfortunately, when I did that, I discovered a minor error in
>         our help file.  Further down, the file talks about "115"
>         datasets even though I had clicked on 114.
>
>         Do not panic.  Stata 114 and 115 formats are identical.  They
>         differ only in that Stata 115 might contain %tb formats for
>         date variables, whereas Stata 114 datasets cannot.
>
>     2.  First, I want Sergiy to use -hexdump- to obtain the header.
>         In Stata 13, type
>
>                 . set more on
>                 . log using <whatever>
>                 . hexdump <filename>.dta
>                   (Press -break- when screen fills up)
>                 . log close
>
>     3.  Here is how you read the 114 and 115 formats:
>
>         Byte 1:  A byte contains two hexadecmial (base 16) digits.
>             Thus, byte one contains two digits.
>
>             Those two digits will be 0x72 or 0x73.  When I write 0x in
>             front of a number, I mean that the number is recorded in
>             hexadecimal.  What the byte actually contains -- and what
>             the dump actually shows -- is "72" or "73".
>
>             FYI, 0x72 = 114 and 0x73 = 115.  That's how Sergiy knew the
>             dataset format.
>
>         Byte 2:  Contains 0x01 or 0x02, meaining HILO or LOHI byte
>             ordering, respectively.  We are gong to need the byte order
>             to interpet bytes 5-6 and 7-10 later.  If the byte order is
>             HILO, we can just read the numbers just as as they are
>             written.  If the byte order is LOHI, we will have to
>             reverse the order of pairs of digits.  I will explain when
>             the problem arises.
>
>         Byte 3:  Contains 0x01.  It always contains this when the dataset
>             format is 114 or 115.
>
>         Byte 4:  Contains 0x00.  It always contains this when the dataset
>             format is 114 or 115.
>
>         Bytes 5-6:  contains a four-digit hexadecimal number.  That
>              four-digit number says how many variables are in the
>              dataset.
>
>              Let's pretend our file contains 0x0a0b.
>
>              If the byte order (byte 2) is HILO, we can translate
>              directly from base 16 to base 10:  We have hex number
>              a0b, we type -inten 16 a0b-, and learn the dataset
>              contains 2,471 variables.
>
>              If the byte order is LOHI, however, must must first reverse
>              the bytes.  Remember, each byte contains 2 digits.  Thus,
>              Thus (LOHI) 0x0a0b = (HILO) 0x0b0a.  So we type -inten 16 b0a-
>              and learn the dataset contains 2,826 variables.
>
>         Bytes 7-10:  contains an eight-digit hexadecmial number
>              corresponding to the number of observations.
>
>              Let's pretend out datset contains 0x0002fa03.
>
>              Just as before, we can read it it from left-to-right if
>              the byteorder is HILO.  We type -inten 16 2fa03- and learn
>              we have 195,075 observations.
>
>              If numbers are stored in LOHI format, we must reverse
>              the digits; (LOHI) 0002fa03 = (HILO) 03fa0200.
>              We type -inten 16 3fa0200- and learn our dataset contains
>              66,716,160 observations.
>
> Okay, now we know the number of variables and number of observations the
> dataset SHOULD contain.
>
> Sergiy was able to read the dataset with a previous version of Stata.
>
> How many observations does the old Stata report?  It needs to match
> or the dataset is corrupted.
>
> Now, look at the last observation.  Type,
>
>         . list in l
>
> In theory, it makes no difference whether Sergiy does this with an OLD
> Stata or Stata 13.  If I were Sergiy, I'd do it both ways just for my
> own peace of mind.
>
> Anyway, look at the the last observation.  Look especially at the end
> variables.  Do they look correct?  If they look correct, they probably
> are correct.  Corrupt data usually looks corrupt because values will be
> out of range.  A person's age won't randomly change from 48 to a number
> within the reasonable range for ages; it is more likely to randomly
> change to a number outside of that range because there are so many more
> of them.
>
> I'd probably trust the data if the last obsrvaiton looked good.
>
>
> More to do
> ----------
>
> After the data, the next and last thing recorded in the 114 and 115 format
> datasets are the value labels.
>
> If the file was shortened, it is likely that not all value labels that
> should be defined are defined, and possibly the last value label does not
> have all the labels defined that it should.
>
> Here at StataCorp, we would do the following:
>
>         . set more off
>         . log using fulllog
>         . hexdump <originalfile>.dta
>         . log close
>
> and we would look at the end of the log.
>
> I am also wondering whether the file was not shortened, but
> accidentally lengthened, say by a mailer adding linefeed or carriage
> return and linefeed to the end of the file.  Linefeed is 0x0a and
> carriage return 0x0d.
>
> Does the file end in 0x0d0a or in 0x0a?
>
> I hope this helps.
>
> -- Bill
> wgould@stata.com
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index