Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: binary format type str question


From   Mark Fisher <mark@markfisher.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: binary format type str question
Date   Tue, 13 Mar 2007 13:12:43 -0500

Wow, thanks so much for your help. Let me say first that I don't have access to Stata, so I can't do a -hexdump-. For reference, I'm using

http://www.stata.com/help.cgi?dta

Pretty much everything in this document (plus everything in your email) makes sense to me. But I can't make the mapping between the typelist that I get and what's in the data part of the file.

I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the data start. Then, in order to do some deconstrubtion, I simply read *all* the remaining bytes in the file; there are only 1071 of them. Since there are 6 variables (with types 98, 136, 102, 105, 102, and 98) and 51 observations, I don't see how I can possibly account for all of them since this only allows for 21 bytes per observation.

But a clear pattern emerges that if I partition the list of bytes into a matrix of 51 rows and 21 columns. The first column contains byte values running consecutively from 1 to 51 --- apparently an index encoded as the byte value itself. (How do I make a correspondence between type 98 and this variable?) The next two columns contain two characters: state abbreviations (such as AL, AK, AZ, ...). (Again, how do I make a correspondence between type 136 and this variable?) Then next 7 columns (that is columns 3 to 9) are identical row by row: {0, 1, 12, 0, 0, 0, 64}. None of the remaining columns has identical rows. (Some of the remaining columns have zeros in them.)

Anyway, that's where I stand. Is it possible this dta file was created in a nonstandard way? (All the dta files I have are from Andrew Gelman's web site for his new "Data Analysis" book. The one I can actually read says "Written by R." in the data_label.) Are there other dta files available on the web that I can experiment with?

--Mark.



William Gould, Stata wrote:

Mark Fisher <mark@markfisher.net> writes,

I'm writing a Mathematica program to read stata "dta" files. [...]  I have
Everything seems to work fine [...]  But I can't figure out how to properly
read the data when the data types are in the range 1 to 244 (str1, str2, ...
str244).  [...]
David Kantor <kantor.d@att.net> speculated "that the string types are stored
such that... they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".

That would have been my guessas to Mark's problem, too, but Mark says No.

I want to suggest Mark become familiar with Stata's -hexdump- command.

Here's an example I just did:

============================================================================
. describe

Contains data from example.dta
obs: 2 vars: 4 13 Mar 2007 08:37
size: 22 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
a byte %8.0g b str2 %9s c str3 %9s d byte %8.0g -------------------------------------------------------------------------------
Sorted by:
. list
+----------------+
| a b c d |
|----------------|
1. | 1 x 2 |
2. | 3 yz a 4 |
+----------------+

. hexdump example.dta | | character
| hex representation | representation
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......ÃŒ.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ãú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007 08:37.û.. 70 | fb61 0000 0000 0000 0000 0000 0000 0000 | ûa.............. | |
80 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 90 | 0000 6200 0000 0000 0000 0000 0000 0000 | ..b............. a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ b0 | 0000 0063 0000 0000 0000 0000 0000 0000 | ...c............ | |
c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ d0 | 0000 0000 6400 0000 0000 0000 0000 0000 | ....d........... e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ f0 | 0000 0000 0000 0000 0000 0000 0000 0025 | ...............% | |
100 | 382e 3067 0000 0000 0000 0025 3973 0000 | 8.0g.......%9s.. 110 | 0000 0000 0000 0025 3973 0000 0000 0000 | .......%9s...... 120 | 0000 0025 382e 3067 0000 0000 0000 0000 | ...%8.0g........ 130 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ * | |
2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 | .............x.. 300 | 0065 0203 797a 6100 0004 | .e..yza...
============================================================================

Let's work our way through this while looking at -help dta-

1. Header
----------

The first 109 bytes are header. 109 base 10 = 6d base 16. Here are bytes 0 through 6c from the dump:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......ÃŒ.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ãú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û..
Mark can read this. Note that the data and the time stamp are binary-0
terminated. For example, the time stamp is:

50 | 31 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û.. \
binary 0


2. Descriptors
---------------

The descriptor has 5 components:

component length
------------------------
typelist nvar
varlist nvar*33
srtlist nvar*2 + 2
fmtlist nvar*12
lbllist nvar*33
------------------------

nvar = 4 in our case. The descriptor starts at byte 109, so let's fill in the
table:
-- in hex --
component length begin end begin end
-------------------------------------------------------------
typelist 4 109 112 6d 70
varlist 132 113 244 71 f4
srtlist 10 245 254 f5 fe
fmtlist 48 255 302 ff 12e
lbllist 132 303 434 12f 1b2
-------------------------------------------------------------
(by the way, I type in Stata -inbase 16 #- to convert from base 10 to base 16. E.g., -inbase 16 109-.)

So here is the typlist:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
60 | fb 0203 | r 2007 08:37.û.. 70 | fb | ûa..............
The types are

type
------------------------------
var. 1 fb = 251 -> byte
var. 2 2 = 2 -> str2
var. 3 3 = 3 -> str3
var. 4 fb = 251 -> byte
------------------------------


3. Variable labels
-------------------

Each variable label is 81 bytes long. Variable labels start at byte 435:

-- in hex --
length begin end begin end
--------------------------------------------------
var. 1 81 435 515 1b3 203
var. 2 81 516 596 204 254
var. 3 81 597 677 255 2a5
var. 4 81 678 758 2a6 2f6
---------------------------------------------------


4. Expansion fields
--------------------

The expansion field starts at byte 759 (2f7 base 16). The expansion field contains
-- in hex --
length begin end begin end -----------------------------------------------------
datatype byte 1 759 759 2f7 2f7
len 4 760 763 2f8 2fb
(and repeats)
-----------------------------------------------------

Our dataset contains:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 00 0000 0000 | .............x..
meaning datatype=0 and len=0, meaning there are no expansion fields.



5. The data (at last!)
-----------------------

The data starts at byte 764 (hex 2fc). Each record is an observation, which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have -- in hex --
length begin end begin end
--------------------------------------------------
obs 1. 7 764 770 2fc 302
obs 2. 7 771 777 303 309
--------------------------------------------------

Observation 1 is


address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 0178 0000 | .............x.. 300 | 0065 02 | .e..yza...
and observation 2 is

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
300 | 03 797a 6100 0004 | .e..yza...
Let's break apart observation 1:

type hex value meaning
------------------------------------------------------
var 1. byte 01 numeric 1
var 2. str2 7800 string 7800 = "x" (0 terminated)
var 3. str3 000065 string 000076 = "" (0 terminated)
var 4. byte 02 numeric 2
---------------------------------------------------

Note that var3 is 000076. The binary 0 is right up front, so the string is "". the 0076 that follows is junk and ignorred.

Let's break apart observations 2:

type hex value meaning
--------------------------------------------------------------------
var 1. byte 03 numeric 3
var 2. str2 797a string 797a = "yz" (not 0 terminated)
var 3. str3 610000 string 610000 = "a" (0 terminated)
var 4. byte 04 numeric 4
--------------------------------------------------------------------

Note that var 2 is not zero terminated. If we were storing the string in a language that required 0 termination (say C), we would code
mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;


Conclusion
----------

I hope this helps.
Mark was worried that there was something about about how strings appear in the .dta dataset. There is nothing strange except for the lack of 0 termination when the string is full length, and 0 termination when less than full length.

Mark needs to -hexdump- his dataset and then include debug code in his program.

-- Bill
wgould@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index