Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: binary format type str question


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: binary format type str question
Date   Tue, 13 Mar 2007 09:42:55 -0500

Mark Fisher <mark@markfisher.net> writes, 

> I'm writing a Mathematica program to read stata "dta" files. [...]  I have
> Everything seems to work fine [...]  But I can't figure out how to properly
> read the data when the data types are in the range 1 to 244 (str1, str2, ...
> str244).  [...]

David Kantor <kantor.d@att.net> speculated "that the string types are stored
such that...  they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".

That would have been my guessas to Mark's problem, too, but Mark says No.

I want to suggest Mark become familiar with Stata's -hexdump- command.

Here's an example I just did:

============================================================================
. describe

Contains data from example.dta
  obs:             2                          
 vars:             4                          13 Mar 2007 08:37
 size:            22 (99.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
a               byte   %8.0g                  
b               str2   %9s                    
c               str3   %9s                    
d               byte   %8.0g                  
-------------------------------------------------------------------------------
Sorted by:  

. list 

     +----------------+
     | a    b   c   d |
     |----------------|
  1. | 1    x       2 |
  2. | 3   yz   a   4 |
     +----------------+

. hexdump example.dta 
                 |                                         |    character
                 |           hex representation            |  representation
         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
               0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 
              10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 
              20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ðú 
              30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... 
                 |                                         |
              40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 
              50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007 08:37.û.. 
              70 | fb61 0000 0000 0000 0000 0000 0000 0000 | ûa.............. 
                 |                                         |
              80 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              90 | 0000 6200 0000 0000 0000 0000 0000 0000 | ..b............. 
              a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              b0 | 0000 0063 0000 0000 0000 0000 0000 0000 | ...c............ 
                 |                                         |
              c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              d0 | 0000 0000 6400 0000 0000 0000 0000 0000 | ....d........... 
              e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
              f0 | 0000 0000 0000 0000 0000 0000 0000 0025 | ...............% 
                 |                                         |
             100 | 382e 3067 0000 0000 0000 0025 3973 0000 | 8.0g.......%9s.. 
             110 | 0000 0000 0000 0025 3973 0000 0000 0000 | .......%9s...... 
             120 | 0000 0025 382e 3067 0000 0000 0000 0000 | ...%8.0g........ 
             130 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 
               * |                                         |
             2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 | .............x.. 
             300 | 0065 0203 797a 6100 0004                | .e..yza...       

============================================================================

Let's work our way through this while looking at -help dta-

1.  Header
----------

The first 109 bytes are header.  109 base 10 = 6d base 16.  Here are 
bytes 0 through 6c from the dump:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
         --------+-----------------------------------------+-----------------
               0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 
              10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 
              20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....Ðú 
              30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... 
                 |                                         |
              40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 
              50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00        | r 2007 08:37.û.. 

Mark can read this.  Note that the data and the time stamp are binary-0
terminated.  For example, the time stamp is:

              50 |                            31 3320 4d61 | ..F........13 Ma 
              60 | 7220 3230 3037 2030 383a 3337 00        | r 2007 08:37.û.. 
                                                  \
                                                binary 0


2.  Descriptors
---------------

The descriptor has 5 components:

	component      length
	------------------------
	typelist       nvar
	varlist        nvar*33
	srtlist        nvar*2 + 2
	fmtlist        nvar*12
	lbllist        nvar*33
	------------------------

nvar = 4 in our case.  The descriptor starts at byte 109, so let's fill in the
table:
                                                        -- in hex --
	component      length         begin    end      begin    end
	-------------------------------------------------------------
	typelist            4           109    112         6d     70
	varlist           132           113    244         71     f4
	srtlist            10           245    254         f5     fe
	fmtlist            48           255    302         ff    12e
	lbllist           132           303    434        12f    1b2
	-------------------------------------------------------------
	(by the way, I type in Stata -inbase 16 #- to convert from 
	 base 10 to base 16.  E.g., -inbase 16 109-.)

So here is the typlist:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
         --------+-----------------------------------------+-----------------
              60 |                                 fb 0203 | r 2007 08:37.û.. 
              70 | fb                                      | ûa.............. 

The types are

		  type
	------------------------------
	var. 1      fb = 251  -> byte
	var. 2       2 =   2  -> str2
	var. 3       3 =   3  -> str3
	var. 4      fb = 251  -> byte
	------------------------------


3.  Variable labels
-------------------

Each variable label is 81 bytes long.  Variable labels start at byte 435:

					       -- in hex --
                    length      begin   end    begin    end
	--------------------------------------------------
	var. 1      81            435   515      1b3    203
	var. 2      81            516   596      204    254
	var. 3      81            597   677      255    2a5
	var. 4      81            678   758      2a6    2f6
	---------------------------------------------------


4.  Expansion fields
--------------------

The expansion field starts at byte 759 (2f7 base 16).  The expansion field 
contains
							 -- in hex --
                                length    begin  end     begin    end 
		-----------------------------------------------------
		datatype byte        1    759    759       2f7    2f7
                len                  4    760    763       2f8    2fb
		(and repeats)
		-----------------------------------------------------

Our dataset contains:

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
         --------+-----------------------------------------+-----------------
             2f0 |                  00 0000 0000           | .............x.. 

meaning datatype=0 and len=0, meaning there are no expansion fields.



5.  The data (at last!)
-----------------------

The data starts at byte 764 (hex 2fc).  Each record is an observation, which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have 
					      -- in hex --
                    length     begin   end    begin    end
	--------------------------------------------------
	obs 1.           7       764   770      2fc    302
	obs 2.           7       771   777      303    309
	--------------------------------------------------

Observation 1 is


         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
         --------+-----------------------------------------+-----------------
             2f0 |                               0178 0000 | .............x.. 
             300 | 0065 02                                 | .e..yza...       

and observation 2 is

         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
         --------+-----------------------------------------+-----------------
             300 |        03 797a 6100 0004                | .e..yza...       

Let's break apart observation 1:

                 type    hex value    meaning
	------------------------------------------------------
	var 1.   byte    01           numeric 1
	var 2.   str2    7800         string 7800 = "x"  (0 terminated)
	var 3.   str3    000065       string 000076 = "" (0 terminated)
	var 4.   byte    02           numeric 2
	---------------------------------------------------

Note that var3 is 000076.  The binary 0 is right up front, so the string 
is "".  the 0076 that follows is junk and ignorred.

Let's break apart observations 2:

                 type    hex value    meaning
	--------------------------------------------------------------------
	var 1.   byte    03           numeric 3
	var 2.   str2    797a         string 797a = "yz"  (not 0 terminated)
	var 3.   str3    610000       string 610000 = "a" (0 terminated)
	var 4.   byte    04           numeric 4
	--------------------------------------------------------------------

Note that var 2 is not zero terminated.  If we were storing the string 
in a language that required 0 termination (say C), we would code 

		mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;


Conclusion
----------

I hope this helps.  

Mark was worried that there was something about about how strings appear 
in the .dta dataset.  There is nothing strange except for the lack of 0 
termination when the string is full length, and 0 termination when less 
than full length.

Mark needs to -hexdump- his dataset and then include debug code in his 
program.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index