Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: binary format type str question


From   Mark Fisher <[email protected]>
To   [email protected]
Subject   Re: st: binary format type str question
Date   Wed, 14 Mar 2007 07:49:17 -0500

Thank you for your suggestions. To some extent I have solved the problem: The format of the files I was having difficulty with are different from what is given in the documentation available at the web site. (The first byte is 110 instead of 113.) I now have no problems reading files with the current format (113). In any event, I will look at the C code for the R function that reads dta files. In fact, I was wondering where to find it. Thanks again.

--Mark

Sergiy Radyakin wrote:

Hello Mark,

if you haven't solved this problem yet, I would suggest that you use another dataset to see if the problem is file-specific or code-specific.

E.g. try a trivial case -- a dataset with only one string variable and see if your code can get it right. Alternatively try it with a publicly available dataset, so that the statalisters can also have a look at it.
You have mentioned that you read all the data as one chunk. I would suggest you reading data observation by observation, by defining a record structure based on the file header.
If the size of the data area is different from what you expect, check if you handle the Hi/Lo byte order correctly when you read the header.

Another hint is this C code to read Stata files (2002) by Thomas Lumley. It is a part of the Foreign package for R. You can download it here:
http://cran.r-project.org/src/contrib/Descriptions/foreign.html
(choose package source in gz format even if you work in windows. Windows binary archive does not contain the source code).

Below is a UUEncoded trivial file:

begin 644 test.dta
M<0(!``$`!@````!/FP`!````+/L$`3RY4P!@^+T`D/L$`9#[!`$@/%T`````
M`/S[!`%`GX,`9/L$`07IT7>=`___B`,#`(L!``````````````$````!````
MU3$T($UA<B`R,#`W(#$P.C,T``9V87(Q`'1E````````````````````````
M````````````````)3ES````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
L``````````!A``````!A8@````!A8F,```!A8F-D``!A8F-D90!A8F-D968`
`
end
sum -r/size 45278/314


Which looks in Stata as:
obs: 6
vars: 1 14 Mar 2007 10:34
size: 60 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 str6 %9s
-------------------------------------------------------------------------------
Sorted by:

And contains the following data:

. l,noo

+--------+
var1
--------
a
ab
abc
abcd
abcde
--------
abcdef
+--------+

If your code can correctly parse this file, then the dataset that you have might be written in a different format.

Regards,
Sergiy





----- Original Message ----- From: "Mark Fisher" <[email protected]>
To: <[email protected]>
Sent: Tuesday, March 13, 2007 7:12 PM
Subject: Re: st: binary format type str question



Wow, thanks so much for your help. Let me say first that I don't have access to Stata, so I can't do a -hexdump-. For reference, I'm using

http://www.stata.com/help.cgi?dta

Pretty much everything in this document (plus everything in your email) makes sense to me. But I can't make the mapping between the typelist that I get and what's in the data part of the file.

I've learned a bit more about the structure of the file in question.
I read the file (correctly, I think) right up to the point where the data start. Then, in order to do some deconstrubtion, I simply read *all* the remaining bytes in the file; there are only 1071 of them. Since there are 6 variables (with types 98, 136, 102, 105, 102, and 98) and 51 observations, I don't see how I can possibly account for all of them since this only allows for 21 bytes per observation.

But a clear pattern emerges that if I partition the list of bytes into a matrix of 51 rows and 21 columns. The first column contains byte values running consecutively from 1 to 51 --- apparently an index encoded as the byte value itself. (How do I make a correspondence between type 98 and this variable?) The next two columns contain two characters: state abbreviations (such as AL, AK, AZ, ...). (Again, how do I make a correspondence between type 136 and this variable?) Then next 7 columns (that is columns 3 to 9) are identical row by row: {0, 1, 12, 0, 0, 0, 64}. None of the remaining columns has identical rows. (Some of the remaining columns have zeros in them.)

Anyway, that's where I stand. Is it possible this dta file was created in a nonstandard way? (All the dta files I have are from Andrew Gelman's web site for his new "Data Analysis" book. The one I can actually read says "Written by R." in the data_label.) Are there other dta files available on the web that I can experiment with?

--Mark.



William Gould, Stata wrote:

Mark Fisher <[email protected]> writes,
I'm writing a Mathematica program to read stata "dta" files. [...] I have
Everything seems to work fine [...] But I can't figure out how to properly
read the data when the data types are in the range 1 to 244 (str1, str2, ...
str244). [...]
David Kantor <[email protected]> speculated "that the string types are stored
such that... they have a 0-byte terminator if they are shorter than the
maximal length of the type; they have no terminator othrwise".

That would have been my guessas to Mark's problem, too, but Mark says No.

I want to suggest Mark become familiar with Stata's -hexdump- command.

Here's an example I just did:

============================================================================
. describe

Contains data from example.dta
obs: 2 vars: 4 13 Mar 2007 08:37
size: 22 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
a byte %8.0g b str2 %9s c str3 %9s d byte %8.0g -------------------------------------------------------------------------------
Sorted by: . list +----------------+
| a b c d |
|----------------|
1. | 1 x 2 |
2. | 3 yz a 4 |
+----------------+

. hexdump example.dta | | character
| hex representation | representation
address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00fb 0203 | r 2007 08:37.û.. 70 | fb61 0000 0000 0000 0000 0000 0000 0000 | ûa.............. | |
80 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ 90 | 0000 6200 0000 0000 0000 0000 0000 0000 | ..b............. a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ b0 | 0000 0063 0000 0000 0000 0000 0000 0000 | ...c............ | |
c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ d0 | 0000 0000 6400 0000 0000 0000 0000 0000 | ....d........... e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ f0 | 0000 0000 0000 0000 0000 0000 0000 0025 | ...............% | |
100 | 382e 3067 0000 0000 0000 0025 3973 0000 | 8.0g.......%9s.. 110 | 0000 0000 0000 0025 3973 0000 0000 0000 | .......%9s...... 120 | 0000 0025 382e 3067 0000 0000 0000 0000 | ...%8.0g........ 130 | 0000 0000 0000 0000 0000 0000 0000 0000 | ................ * | |
2f0 | 0000 0000 0000 0000 0000 0000 0178 0000 | .............x.. 300 | 0065 0203 797a 6100 0004 | .e..yza... ============================================================================

Let's work our way through this while looking at -help dta-

1. Header
----------

The first 109 bytes are header. 109 base 10 = 6d base 16. Here are bytes 0 through 6c from the dump:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
0 | 7102 0100 0400 0200 0000 0000 0000 0300 | q............... 10 | 0000 0000 0000 cc00 4500 0000 0000 0000 | ......Ì.E....... 20 | 0000 0000 0000 ac4b 6600 0000 0000 d0fa | ......¬Kf.....�ú 30 | b200 0000 0000 a0b1 2c0b ff7f 0000 0000 | ²......±,....... | |
40 | 0000 0000 0000 0300 0000 0000 0000 0500 | ................ 50 | 0000 4600 0000 0800 0000 0031 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û.. Mark can read this. Note that the data and the time stamp are binary-0
terminated. For example, the time stamp is:

50 | 31 3320 4d61 | ..F........13 Ma 60 | 7220 3230 3037 2030 383a 3337 00 | r 2007 08:37.û.. \
binary 0


2. Descriptors
---------------

The descriptor has 5 components:

component length
------------------------
typelist nvar
varlist nvar*33
srtlist nvar*2 + 2
fmtlist nvar*12
lbllist nvar*33
------------------------

nvar = 4 in our case. The descriptor starts at byte 109, so let's fill in the
table:
-- in hex --
component length begin end begin end
-------------------------------------------------------------
typelist 4 109 112 6d 70
varlist 132 113 244 71 f4
srtlist 10 245 254 f5 fe
fmtlist 48 255 302 ff 12e
lbllist 132 303 434 12f 1b2
-------------------------------------------------------------
(by the way, I type in Stata -inbase 16 #- to convert from base 10 to base 16. E.g., -inbase 16 109-.)

So here is the typlist:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
60 | fb 0203 | r 2007 08:37.û.. 70 | fb | ûa.............. The types are

type
------------------------------
var. 1 fb = 251 -> byte
var. 2 2 = 2 -> str2
var. 3 3 = 3 -> str3
var. 4 fb = 251 -> byte
------------------------------


3. Variable labels
-------------------

Each variable label is 81 bytes long. Variable labels start at byte 435:

-- in hex --
length begin end begin end
--------------------------------------------------
var. 1 81 435 515 1b3 203
var. 2 81 516 596 204 254
var. 3 81 597 677 255 2a5
var. 4 81 678 758 2a6 2f6
---------------------------------------------------


4. Expansion fields
--------------------

The expansion field starts at byte 759 (2f7 base 16). The expansion field contains
-- in hex --
length begin end begin end -----------------------------------------------------
datatype byte 1 759 759 2f7 2f7
len 4 760 763 2f8 2fb
(and repeats)
-----------------------------------------------------

Our dataset contains:

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 00 0000 0000 | .............x.. meaning datatype=0 and len=0, meaning there are no expansion fields.



5. The data (at last!)
-----------------------

The data starts at byte 764 (hex 2fc). Each record is an observation, which
is our case, is 1+2+3+1 = 7 bytes longs (see 2. Descriptors, above).
Thus, we have -- in hex --
length begin end begin end
--------------------------------------------------
obs 1. 7 764 770 2fc 302
obs 2. 7 771 777 303 309
--------------------------------------------------

Observation 1 is


address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
2f0 | 0178 0000 | .............x.. 300 | 0065 02 | .e..yza... and observation 2 is

address | 0 1 2 3 4 5 6 7 8 9 a b c d e f | 0123456789abcdef
--------+-----------------------------------------+-----------------
300 | 03 797a 6100 0004 | .e..yza... Let's break apart observation 1:

type hex value meaning
------------------------------------------------------
var 1. byte 01 numeric 1
var 2. str2 7800 string 7800 = "x" (0 terminated)
var 3. str3 000065 string 000076 = "" (0 terminated)
var 4. byte 02 numeric 2
---------------------------------------------------

Note that var3 is 000076. The binary 0 is right up front, so the string is "". the 0076 that follows is junk and ignorred.

Let's break apart observations 2:

type hex value meaning
--------------------------------------------------------------------
var 1. byte 03 numeric 3
var 2. str2 797a string 797a = "yz" (not 0 terminated)
var 3. str3 610000 string 610000 = "a" (0 terminated)
var 4. byte 04 numeric 4
--------------------------------------------------------------------

Note that var 2 is not zero terminated. If we were storing the string in a language that required 0 termination (say C), we would code mempcy(dest, bufpos, 2) ; dest[2] = '\0' ;


Conclusion
----------

I hope this helps. Mark was worried that there was something about about how strings appear in the .dta dataset. There is nothing strange except for the lack of 0 termination when the string is full length, and 0 termination when less than full length.

Mark needs to -hexdump- his dataset and then include debug code in his program.

-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index