[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: binary format type str question

From	Mark Fisher <[email protected]>
To	[email protected]
Subject	Re: st: binary format type str question
Date	Wed, 14 Mar 2007 07:45:11 -0500

Thanks so much for your help. I have no problems reading the test files. Also, I now realize the first byte in the files I am having trouble with is not 113 by 110. Is there a document that explains this format (and other non-113 Stata formats) or should I just give up?

--Mark.

William Gould, Stata wrote:

Mark Fisher <[email protected]> has more questions about reading, with an eye to translating, .dta files.

I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the data start. Then, in order to do some deconstrubtion, I simply read *all* the remaining bytes in the file; there are only 1071 of them. Since there are 6 variables (with types 98, 136, 102, 105, 102, and 98) and 51 observations, I don't see how I can possibly account for all of them since this only allows for 21 bytes per observation.

Something is not adding up. Later in his post, Mark asks, "Is it possible
this dta file was created in a nonstandard way?"
The answer is conditionally no, the condition being that the first byte in the
file is 0x71. That is an important condition. In earlier file formats, types
were coded differently. For instance, if the first byte is 0x70, then the
file is from Stata 8.0, and the format was a little different. If the first
byte is 0x6f, then the file is from Stata/SE 7.0, and is different yet again.
Historically, the number has ranged from 0x66.

Mark also asks, "Are there other dta files available on the web that I can
experiment with?"

Point your browser to http://www.stata-press.com/data/r9/

Datasets that are used in the various Stata manuals are there.

Anyway, here is how things are supposed to work:

The typlist Mark reported as
type ---------------
var. 1 98 var. 2 136 var. 3 102 var. 4 105 var. 5 102 var. 6 98 ---------------

From that, I can build the following table:

type meaning length offset
---------------------------------------------
var. 1 98 str98 98 0
var. 2 136 str136 136 98
var. 3 102 str102 102 234
var. 4 105 str105 105 336
var. 5 102 str102 102 441
var. 6 98 str98 98 543
---------------------------------------------
sum 641

The width of an observation (a.k.a. lrecl) is 641 bytes. The approved method is to read the data an observation at a time, in 641 byte chunks.
I will now use C jargon. Let (unsigned char *) buf[] contain one observation.
You can then extract each variable using memcpy(), using the offsets and
lengths from the table above. Once extracted, if numeric, and if bytes need
reordering, reorder them. If string, add a binary 0 terminator in case one is
missing.

This should be easy to code, but you will have to build a table in your code to direct what needs to be done.

-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Accumulation function
  - From: "Victor M. Zammit" <[email protected]>

References:
- Re: st: binary format type str question
  - From: [email protected] (William Gould, Stata)

Prev by Date: st: 5th German SUG: Final Announcement and Program
Next by Date: Re: st: binary format type str question
Previous by thread: Re: st: binary format type str question
Next by thread: st: Accumulation function
Index(es):
- Date
- Thread