From
Sergiy Radyakin <serjradyakin@gmail.com>

To
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>

Subject
Re: st: Error 612 on .dta in Stata 13.1

Date
Mon, 9 Dec 2013 19:09:18 -0500

Dear James and Bill, thank you very much for your advice! The problem appears to be a combination of all of the following: 1) data file being truncated, and 2) data file being corrupt within the remaining length, and 3) tolerance of earlier Statas to the data problem of truncated file and its non-transparent handling of such corrupt files. I've put a more verbose report here: http://www.radyakin.org/statalist/statabugs/incomplete_f.htm Ironically, I requested this behavior back in 2011. But it didn't occur to me until after a couple of hours after I posted the email. James, I am afraid unfolding Pyton on the server would not be an immediate possibility for me, but I will one day try it with your command. If you are unsure of how it is going to react, perhaps try it with the replication script from here: http://www.radyakin.org/statalist/statabugs/incomplete_file.htm Best, Sergiy Radyakin On Mon, Dec 9, 2013 at 3:07 PM, William Gould, StataCorp LP <wgould@stata.com> wrote: > Sergiy Radyakin <serjradyakin@gmail.com> reports having two old .dta > files that Stata 11 and 12 can -use- without problem, but that StataMP > 13.1 refuses to read, instead saying > > . use "datafile.dta", clear > .dta file corrupt > The file unexpectedly ended before it should have. > r(612); > > Sergiy is looking for advice and cannot share the data files. > > Sergiy used -hexdump- or something on the file and reports that they > are specification 114, meaning they are from Stata 10. > > > Why can Stata 11 and 12 read the data, but not Stata 13? > -------------------------------------------------------- > > Stata 13 is far more demanding that .dta files match the expected > format than any previous version of Stata. We changed the code and we > changed the file format so that Stata could better determine when a > problem arose. > > These are old files and so Stata 13 is more limited on the kinds of > problems it can detect, but the code is still being more demanding. > > That is why stata 13 cannot read the files but Stata 11 and 12 can. > > > An assumption I am making > ------------------------- > > Sergiy can read the data using a previous version of Stata, he says. I > am assuming that, using the OLD Stata, if Sergiy types > > . use <originaldataset> > > . save copy > > and then if Sergiy switches to Stata 13 and types > > . use copy > > the dataset loads without error. If that is not true, then either > there is an bug in Stata 13 or the orignal dataset is corrupt, and > just reading the corrupted dataset corrupted the OLD Stata session. > > At that point, Sergiy needs to talk to us, because we will want to > determine which is the case. We can sign nondisclosure forms. > > > How to determine how serious the error is > ----------------------------------------- > > Let's assume that using and saving the original data with the OLD Stata > results in a datset Stata 13 can read. > > Let me outline the process we would follow if Sergiy could send us the > dataset: > > 1. In Stata 13, type -help dta-. Click on "114". > Unfortunately, when I did that, I discovered a minor error in > our help file. Further down, the file talks about "115" > datasets even though I had clicked on 114. > > Do not panic. Stata 114 and 115 formats are identical. They > differ only in that Stata 115 might contain %tb formats for > date variables, whereas Stata 114 datasets cannot. > > 2. First, I want Sergiy to use -hexdump- to obtain the header. > In Stata 13, type > > . set more on > . log using <whatever> > . hexdump <filename>.dta > (Press -break- when screen fills up) > . log close > > 3. Here is how you read the 114 and 115 formats: > > Byte 1: A byte contains two hexadecmial (base 16) digits. > Thus, byte one contains two digits. > > Those two digits will be 0x72 or 0x73. When I write 0x in > front of a number, I mean that the number is recorded in > hexadecimal. What the byte actually contains -- and what > the dump actually shows -- is "72" or "73". > > FYI, 0x72 = 114 and 0x73 = 115. That's how Sergiy knew the > dataset format. > > Byte 2: Contains 0x01 or 0x02, meaining HILO or LOHI byte > ordering, respectively. We are gong to need the byte order > to interpet bytes 5-6 and 7-10 later. If the byte order is > HILO, we can just read the numbers just as as they are > written. If the byte order is LOHI, we will have to > reverse the order of pairs of digits. I will explain when > the problem arises. > > Byte 3: Contains 0x01. It always contains this when the dataset > format is 114 or 115. > > Byte 4: Contains 0x00. It always contains this when the dataset > format is 114 or 115. > > Bytes 5-6: contains a four-digit hexadecimal number. That > four-digit number says how many variables are in the > dataset. > > Let's pretend our file contains 0x0a0b. > > If the byte order (byte 2) is HILO, we can translate > directly from base 16 to base 10: We have hex number > a0b, we type -inten 16 a0b-, and learn the dataset > contains 2,471 variables. > > If the byte order is LOHI, however, must must first reverse > the bytes. Remember, each byte contains 2 digits. Thus, > Thus (LOHI) 0x0a0b = (HILO) 0x0b0a. So we type -inten 16 b0a- > and learn the dataset contains 2,826 variables. > > Bytes 7-10: contains an eight-digit hexadecmial number > corresponding to the number of observations. > > Let's pretend out datset contains 0x0002fa03. > > Just as before, we can read it it from left-to-right if > the byteorder is HILO. We type -inten 16 2fa03- and learn > we have 195,075 observations. > > If numbers are stored in LOHI format, we must reverse > the digits; (LOHI) 0002fa03 = (HILO) 03fa0200. > We type -inten 16 3fa0200- and learn our dataset contains > 66,716,160 observations. > > Okay, now we know the number of variables and number of observations the > dataset SHOULD contain. > > Sergiy was able to read the dataset with a previous version of Stata. > > How many observations does the old Stata report? It needs to match > or the dataset is corrupted. > > Now, look at the last observation. Type, > > . list in l > > In theory, it makes no difference whether Sergiy does this with an OLD > Stata or Stata 13. If I were Sergiy, I'd do it both ways just for my > own peace of mind. > > Anyway, look at the the last observation. Look especially at the end > variables. Do they look correct? If they look correct, they probably > are correct. Corrupt data usually looks corrupt because values will be > out of range. A person's age won't randomly change from 48 to a number > within the reasonable range for ages; it is more likely to randomly > change to a number outside of that range because there are so many more > of them. > > I'd probably trust the data if the last obsrvaiton looked good. > > > More to do > ---------- > > After the data, the next and last thing recorded in the 114 and 115 format > datasets are the value labels. > > If the file was shortened, it is likely that not all value labels that > should be defined are defined, and possibly the last value label does not > have all the labels defined that it should. > > Here at StataCorp, we would do the following: > > . set more off > . log using fulllog > . hexdump <originalfile>.dta > . log close > > and we would look at the end of the log. > > I am also wondering whether the file was not shortened, but > accidentally lengthened, say by a mailer adding linefeed or carriage > return and linefeed to the end of the file. Linefeed is 0x0a and carriage return 0x0d.

Does the file end in 0x0d0a or in 0x0a?

I hope this helps.

-- Bill
wgould@stata.com

**References**:**Re: st: Error 612 on .dta in Stata 13.1***From:*"William Gould, StataCorp LP" <wgould@stata.com>

