Stata 15 help for dta_117

Title

[P] file formats .dta -- Description of .dta file format 117

Warning: The entry below describes the contents of an old Stata .dta file format. Newer version of Stata continue to read, and perhaps to write, this old format. What follows is the original help file for the .dta file format when it was the current file format.

Description

Described below is the format of Stata .dta datasets. The description is highly technical and aimed at those needing to write programs in C or other languages to read and write them.

The format described here went into effect as of Stata 13. For documentation on earlier file formats, see dta_115.

Remarks

The format of .dta files has changed over time. Stata 13 writes what are known as .dta format-117 files and can read all formats of files that have ever been released. The recent history of .dta formats is

Format Current as of --------------------------------------- 117 Stata 13 116 internal; never released 115 Stata 12 114 Stata 10 113 Stata 8 ---------------------------------------

Format 117 is documented below.

Remarks are presented under the following headings:

1. Introduction 2. Versions and flavors of Stata 3. Representation of strings 4. Representation of numbers 5. Dataset format definition 5.1 Header 5.1.1 File format id 5.1.2 Byteorder 5.1.3 K, # of variables 5.1.4 N, # of observations 5.1.5 Dataset label 5.1.6 Datetime stamp 5.2 Map 5.3 Variable types 5.4 Variable names 5.5 Sort order of observations 5.6 Display formats 5.7 Value-label names 5.8 Variable labels 5.9 Characteristics 5.10 Data 5.11 StrLs 5.11.1 (v,o) values 5.11.2 GSOs 5.11.3 Advice on writing strLs 5.11.4 Advice on reading strLs 5.12 Value labels

1. Introduction

Stata-format datasets record data in a way generalized to work across computers that do not agree on how data are recorded. Thus the same dataset may be used, without translation, on Windows, Unix, and Mac computers. Given a computer, datasets are divided into two categories: native-format and foreign-format datasets. Stata uses the following two rules:

R1. On a given computer, Stata knows how to write native-format datasets only.

R2. Even so, Stata can read all dataset formats, whether foreign or native.

Rules R1 and R2 ensure that Stata users need not be concerned with dataset formats. If you are writing a program to read and write Stata datasets, you will have to determine whether you want to follow the same rules or instead restrict your program to operate on only native-format datasets. Because Stata follows rules R1 and R2, such a restriction would not be too limiting. If the user had a foreign-format dataset, he or she could enter Stata, use the data, and then save it again.

2. Versions and flavors of Stata

Stata is continually being updated, and these updates sometimes require changes be made to how Stata records .dta datasets. This document describes what are known as format-117 datasets, the most modern format. Stata itself can read older formats, but whenever it writes a dataset, it writes in 117 format.

There are currently three flavors of Stata available: Stata/IC, Stata/SE, and Stata/MP. The same 117 format is used by all flavors. The difference is that datasets can be larger in some flavors.

3. Representation of strings

1. Stata has two formats for strings, known to users as str# and strL. Most strings are recorded in str# format, but that is up to the user. The strL format allows for longer strings, and it allows for both binary and ASCII strings. str# strings can be ASCII only.

By the way, StataCorp internal jargon is to refer to str# strings as "strfs" (pronounced sturfs) and to strLs as "strLs" (pronounced sturls). The f in strf stands for fixed allocation length, which is how strfs are written in the file.

2. We discuss strL format strings in section 5.11.

3. Strfs may be from 1 to 2,045 bytes long.

4. Strfs are recorded with a trailing binary zero (\0) delimiter if the length of the string is less than the maximum declared length. The string is recorded without the delimiter if the string is of the maximum length.

5. Leading and trailing blanks are significant.

6. Strfs use ASCII encoding.

4. Representation of numbers

1. Numbers are represented as 1-, 2-, and 4-byte integers and 4- and 8-byte floats. In the case of floats, ANSI/IEEE Standard 754-1985 format is used, which in the case of the binary floating-point numbers considered here is equivalent to IEEE Standard 754-2008.

2. Byte ordering varies across machines for all numeric types. Bytes are ordered either least significant to most significant, dubbed LSF, or most significant to least significant, dubbed MSF. Intel-based computers, for instance, mostly use LSF encoding. Sun SPARC-based computers use MSF encoding. Itanium-based computers are interesting in that they can be either LSF or MSF depending on the operating system. Windows and Linux on Itanium use LSF encoding. HP-UX on Itanium uses MSF encoding.

3. When reading an MSF number on an LSF machine or an LSF number on an MSF machine, perform the following before interpreting the number:

byte no translation necessary 2-byte int swap bytes 0 and 1 4-byte int swap bytes 0 and 3, 1 and 2 4-byte float swap bytes 0 and 3, 1 and 2 8-byte float swap bytes 0 and 7, 1 and 6, 2 and 5, 3 and 4

4. For purposes of written documentation, numbers are written with the most significant byte listed first. Thus 0x0001 refers to a 2-byte integer taking on the logical value 1.

5. Stata has five numeric data types. They are

byte 1-byte signed int int 2-byte signed int long 4-byte signed int float 4-byte IEEE float double 8-byte IEEE float

6. Each type allows for 27 missing value codes, known as ., .a, .b, ..., .z. For each type, the range allowed for nonmissing values and the missing value codes is

byte minimum nonmissing -127 (0x80) maximum nonmissing +100 (0x64) code for . +101 (0x65) code for .a +102 (0x66) code for .b +103 (0x67) ... code for .z +127 (0x7f)

int minimum nonmissing -32767 (0x8000) maximum nonmissing +32740 (0x7fe4) code for . +32741 (0x7fe5) code for .a +32742 (0x7fe6) code for .b +32743 (0x7fe7) ... code for .z +32767 (0x7fff)

long minimum nonmissing -2,147,483,647 (0x80000000) maximum nonmissing +2,147,483,620 (0x7fffffe4) code for . +2,147,483,621 (0x7fffffe5) code for .a +2,147,483,622 (0x7fffffe6) code for .b +2,147,483,623 (0x7fffffe7) ... code for .z +2,147,483,647 (0x7fffffff)

float minimum nonmissing -1.701e+38 (-1.fffffeX+7e) (sic) maximum nonmissing +1.701e+38 (+1.fffffeX+7e) code for . (+1.000000X+7f) code for .a (+1.001000X+7f) code for .b (+1.002000X+7f) ... code for .z (+1.01a000X+7f)

double minimum nonmissing -1.798e+308 (-1.fffffffffffffX+3ff) maximum nonmissing +8.988e+307 (+1.fffffffffffffX+3fe) code for . (+1.0000000000000X+3ff) code for .a (+1.0010000000000X+3ff) code for .b (+1.0020000000000X+3ff) ... code for .z (+1.01a0000000000X+3ff)

Note that for float, all z>1.fffffeX+7e, and for double, all z>1.fffffffffffffX+3fe are considered to be missing values, and it is merely a subset of the values that are labeled ., .a, .b, ..., .z. For example, a value between .a and .b is still considered to be missing, and in particular, all the values between .a and .b are known jointly as .a_. Nevertheless, the recording of those values should be avoided.

In the table above, we have used the {+|-}1.<digits>X{+|-}<digits> notation. The number to the left of the X is to be interpreted as a base-16 number (the period is thus the base-16 point) and the number to the right (also recorded in base 16) is to be interpreted as the power of 2 (sic). For example,

1.01aX+3ff = (1.01a) * 2^(3ff) (base 16) = (1 + 0/16 + 1/16^2 + 10/16^3) * 2^1023 (base 10)

The {+|-}1.<digits>X{+|-}<digits> notation easily converts to IEEE 8-byte double: the 1 is the hidden bit, the digits to the right of the hexadecimal point are the mantissa bits, and the exponent is the IEEE exponent in signed (removal of offset) form. For instance, pi = 3.1415927... is

8-byte IEEE, MSF ----------------------- pi = +1.921fb54442d18X+001 = 40 09 21 fb 54 44 2d 18

= 18 2d 44 54 fb 21 09 40 ----------------------- 8-byte IEEE, LSF

Converting {+|-}1.<digits>X{+|-}<digits> to IEEE 4-byte float is more difficult, but the same rule applies: the 1 is the hidden bit, the digits to the right of the hexadecimal point are the mantissa bits, and the exponent is the IEEE exponent in signed (removal of offset) form. What makes it more difficult is that the sign-and-exponent in the IEEE 4-byte format occupy 9 bits, which is not divisible by four, and so everything is shifted one bit. In float:

4-byte IEEE, MSF ----------- pi = +1.921fb60000000X+001 = 40 49 0f db

= db of 49 40 ----------- 4-byte IEEE, LSF

The easiest way to obtain the above result is to first convert +1.921fb60000000X+001 to an 8-byte double and then convert the 8-byte double to a 4-byte float.

In any case, the relevant numbers are

V value MSF LSF --------------------------------------------------------------- m -1.fffffffffffffX+3ff ffefffffffffffff ffffffffffffefff M +1.fffffffffffffX+3fe 7fdfffffffffffff ffffffffffffdf7f . +1.0000000000000X+3ff 7fe0000000000000 000000000000e07f .a +1.0010000000000X+3ff 7fe0010000000000 000000000001e07f .b +1.0020000000000X+3ff 7fe0020000000000 000000000002e07f .z +1.01a0000000000X+3ff 7fe01a0000000000 00000000001ae07f

m -1.fffffeX+7e feffffff fffffffe M +1.fffffeX+7e 7effffff ffffff7e . +1.000000X+7f 7f000000 0000007f .a +1.001000X+7f 7f000800 0008007f .b +1.002000X+7f 7f001000 0010007f .z +1.01a000X+7f 7f00d000 00d0007f ---------------------------------------------------------------

5. Dataset format definition

A Stata dataset containing two variables named myfloat and myint -- myfloat a Stata 4-byte float and myint a Stata 2-byte int -- and having one observation with myfloat = myint = 0 and written to disk on a dataset written on 10 July 2013 at 2:23 p.m. would look like this:

---------------------------------------- top of file ----- <stata_dta>

<header> <release>117</release> <byteorder>MSF</byteorder> <K>0002</K> <N>00000001</N> <label>00</label> <timestamp>1110 Jul 2013 14:23</timestamp> </header>

<map> 0000000000000000 0000000000000099 0000000000000141 0000000000000139 0000000000000190 00000000000001ab 0000000000000220 000000000000034e 0000000000000371 0000000000000384 0000000000000393 00000000000003b0 00000000000003bc </map>

<variable_types> fff7 fff9 </variable_types>

<varnames> myfloat00........................ myint00.......................... </varnames>

<sortlist> 000000000000 </sortlist>

<formats> %9.0g00............................... %8.0g00............................... </formats>

<value_label_names> 00................................ 00................................ </value_label_names>

<variable_labels> 00................................................ 00................................................ </variable_labels>

<characteristics> </characteristics>

<data> 000000000000 </data>

<strls> </strls>

<value_labels> </value_labels>

</stata_dta> ---------------------------------------- end of file -----

We have taken liberties in the spacing of the presentation. The file is actually run together, so it looks more like this,

---------------------------------------- top of file ----- <stata_dta><header><release>117</release><byteorder>MSF</by teorder><K>0002</K><N>00000001</N><label>00</label><timesta mp>1110 Jul 2013 14:23</timestamp></header><map>00000000000 00000000000000000009900000000000001410000000000000139000000 000000019000000000000001ab0000000000000220000000000000034e0 00000000000037100000000000003840000000000000393000000000000 03b000000000000003bc</map><variable_types>fff7</variable_ty pes><varnames>myfloat00........................ myint00.... ......................</varnames><sortlist>000000000000</so rtlist><formats>%9.0g00...............................%8.0g 00...............................</formats><value_label_nam es>00................................00.................... .............</value_label_names><variable_labels>00....... ........................................................... ..............00........................................... .....................................</variable_labels><cha racteristics></characteristics><data>000000000000</data><st rls></strls><value_labels></value_labels></stata_dta> ---------------------------------------- end of file -----

We show binary content using hexadecimal values in italics. 00, for instance, means 1-byte binary zero. The 11 following <timestamp> means one byte recording hexadecimal 11, equivalent to decimal 17, and 17 is the length of the datestamp that follows it: "10 Jul 2013 14:23". We show bytes that may contain random values -- that are and should be ignored -- using a period.

A 117-format .dta file begins with <stata_dta> and ends with </stata_dta>:

<stata_dta>..........</stata_dta> / \ start of file end of file

Between those markers appear

header <header>..............</header> file map <map>.................</map> variable types <variable_types>......</variable_types> variable names <varnames>............</varnames> sort order <sortlist>............</sortlist> variable %fmts <formats>.............</formats> value-label names <value_label_names>...</value_label_names> variable labels <variable_labels>.....</variable_labels> characteristics <characteristics>.....</characteristics> data themselves <data>................</data> strLs <strls>...............</strls> value labels <value_labels>........</value_labels>

Each marker pair must appear even if the content is empty, and the marker pairs must appear in the order shown.

5.1 Header

The Header is defined as

<header>...</header>

Between those markers appear

file format id <release>...</release> byteorder <byteorder>...</byteorder> # of variables <K>...</K> # of observations <N>...</N> dataset label <label>...</label> datetime stamp <timestamp>...</timestamp>

Each marker must appear, and it must appear in the order shown.

5.1.1 File format id

The file_format_id is recorded as

<release>117</release>

5.1.2 Byteorder

The byteorder is recorded as

<byteorder>byteorder</byteorder>

where byteorder is either MSF or LSF.

MSF stands for Most Significant byte First. In this encoding, the number 1 recorded as a 2-byte integer would be 00 followed by 01: 0001.

LSF stands for Least Significant byte First. In this encoding, the number 1 recorded as a 2-byte integer would be 01 followed by 00: 0100.

5.1.3 K, # of variables

K, the number of variables stored in the dataset, is recorded as

<K>bb</K>

where K = bb is a 2-byte unsigned integer field recorded according to byteorder.

5.1.4 N, # of observations

N, the number of observations stored in the dataset, is recorded as

<N>bbbb</N>

Where N = bbbb is a 4-byte unsigned integer field recorded according to byteorder.

5.1.5 Dataset label

The dataset label is recorded as

<label>lccccc........c</label> |------------| l characters

First recorded is the length l of the dataset label, excluding a terminating binary 0. l is a 1-byte unsigned integer field required to be 0 <= l <= decimal 80 (hexadecimal 50). The l ASCII characters appear after that.

5.1.6 Datetime stamp

The datetime stamp records the date and time the file was written. The datetime stamp is recorded as

<timestamp>lccccc........c</timestamp> |------------| l characters

First recorded is the length l of the datetime stamp, excluding the terminating binary 0, if any. l is a 1-byte unsigned integer field. The l ASCII characters appear after that.

l is required to be 0 or decimal 17. If l==0, then no datetime stamp is recorded. If l==(decimal) 17, the datetime stamp is recorded in the format

----+----1----+-- dd Mon yyyy hh:mm such as 10 Jul 2013 14:23

If dd<10 or hh<10, the element is recorded with a leading blank or a leading zero:

04 Jul 2032 04:23 4 Jul 2013 4:23

5.2 Map

The map has to do with the position in the file, not the Stata data themselves. The map is recorded as

<map>filepositions</map>

where filepositions is a list (vector) of 14 8-byte offsets from the start of the file, written according to byteorder. The 14 positions to be recorded are

# file position of the start of the ----------------------------------------------- 1. <stata_data>, definitionally 0 2. <map> 3. <variable_types> 4. <varnames> 5. <sortlist> 6. <formats> 7. <value_label_names> 8. <variable_labels> 9. <characteristics> 10. <data> 11. <strls> 12. <value_labels> 13. </stata_data> 14. end-of-file -----------------------------------------------

Notes:

1. File positions are values that can be obtained from and set by C function lseek(). File positions are obtained by lseek(fd, 0, SEEK_CUR) just before writing the marker listed above or, in the case of end-of-file, just after writing </stata_data>.

2. If you are writing a file, we recommend writing <map>...</map> with all file positions filled in with zero as you are proceeding sequentially and tracking the file positions in a structure such as

struct mapdef { off_t hdr ; off_t map ; off_t types ; off_t varnames ; off_t srtlist ; off_t fmts ; off_t vlblnames ; off_t varlabs ; off_t chars ; off_t data ; off_t strls ; off_t vallabs ; off_t tlr ; off_t bot ; } ;

Record file positions in the structure just before writing the corresponding marker. Once you have written </stata_data>, seek to map+5 and rewrite the file positions. Then close() the file.

3. Note that file positions are 8 bytes long, as they would be on a 64-bit computer. If you are on a 32-bit computer, you must set the most-significant 4 bytes to 0 and record your 32-bit file positions in the least-significant 4 bytes. If you are on a MSF computer, you write each file position by first writing 4 bytes of 0 and then the 4-byte file position. If you are on a LSF computer, you write each file position by writing the 4-byte file position and then writing 4 bytes of 0.

5.3 Variable types

Variable types are recorded as

<variable_types>typlist</variable_types>

where typlist is a sequence (vector) of K 2-byte unsigned integer fields written according to byteorder and recording the variable type of variable 1, 2, ..., K.

The types are encoded

Stata typ meaning Description ---------------------------------------------------------- 1 str1 1 character strf 2 str2 2 or fewer characters strf ... etc. 2045 str2045 2,045 or fewer character strf

32768 strL strL of any length

65526 double 8-byte float 65527 float 4-byte float 65528 long 4-byte signed integer 65529 int 2-byte signed integer 65530 byte 1-byte signed integer ----------------------------------------------------------

5.4 Variable names

Variable names are recorded as

<varnames>varnamelist</varnames>

where varnamelist is a sequence (vector) of K 33-character, binary-zero terminated, ASCII variable names.

varnamelist contains the names of the Stata variables 1, ..., K, each up to 32 characters in length and each terminated by a binary zero (\0). For instance, if K==4, varnamelist would be

0 33 66 99 | | | | vbl1\0...myvar\0...thisvar\0...lstvar\0...

The above states that variable 1 is named vbl1, variable 2 myvar, variable 3 thisvar, and variable 4 lstvar. The byte positions indicated by periods will contain random values (and note that we have omitted some of the periods). If varnamelist is read into the C-array char varnamelist[], then &varnamelist[(i-1)*33] points to the name of the ith variable, 1 <= i <= K.

5.5 Sort order of observations

The sort order in which the observations will be subsequently recorded is recorded as

<sortlist>sortlist</sortlist>

where sortlist is a sequence (array) of K+1 unsigned 2-byte integers recorded according to byteorder.

sortlist specifies the sort-order of the dataset and is terminated by a 2-byte zero (0000 in hex). Each 2-byte element contains either a variable number or zero. The zero marks the end of the sortlist, and the recorded positions after that contain random junk. For instance, if the data are not sorted, the first 2-byte integer will contain a zero, and the 2-byte integers thereafter will contain junk. If nvar==4, the record will appear as

0000................

If the dataset is sorted by one variable, say myvar, and if that variable is the second variable in the varnamelist, the record will appear as

00020000............ (if byteorder==MSF) 02000000............ (if byteorder==LSF)

If the dataset is sorted by myvar and within myvar by vbl1, and if vbl1 is the first variable in the dataset, the record will appear as

000200010000........ (if byteorder==MSF) 020001000000........ (if byteorder==LSF)

If sortlist were read into the C-array short int srtlist[], then srtlist[0] would be the variable number of the first sort variable or, if the data were not sorted, 0. If the number is not 0, srtlist[1] would be the variable number of the second sort variable or, if there is not a second sort variable, 0, and so on.

5.6 Display formats

The display formats associated with each variable are recorded as

<formats>fmtlist</formats>

fmtlist contains the formats of the variables 1, ..., K. Each format is 49 bytes long and includes a binary-zero end-of-string marker. For instance,

%9.0f\0..........................................%8.2f\0...... ....................................%20.0g\0.................. .......................%td\0.................................. ..........%tcDDmonCCYY_HH:MM:SS.sss\0......................

indicates that variable 1 has a %9.0f format, variable 2 a %8.2f format, variable 3 a %20.0g format, and so on. Note that these are Stata formats, not C formats.

1. Formats beginning with %t or %-t are Stata's date and time formats.

2. Stata has an old %d format notation, and some datasets still have them. Format %d... is equivalent to modern format %td... and %-d... is equivalent to %-td...

3. Nondate formats ending in gc or fc are similar to C's g and f formats, but with commas. Most translation routines would ignore the ending c (change it to \0).

4. Formats may contain commas rather than periods, such as %9,2f, indicating European format.

If fmtlist is read into the C-array char fmtlist[], then &fmtlist[49*(i-1)] refers to the starting address of the format for the ith variable.

5.7 Value-label names

The value-label names associated with each variable are recorded as

<value_label_names>lbllist</value_label_names>

where lbllist is a sequence (array) of K 33-character, binary-zero terminated strings.

lbllist contains the names of the value formats associated with the variables 1, ..., K. Each value-format name is 33 bytes long and includes a binary-zero end-of-string marker. For instance,

0 33 66 99 | | | | \0...yesno\0...\0...yesno\0...

indicates that variables 1 and 3 have no value label associated with them, whereas variables 2 and 4 are both associated with the value label named yesno. If lbllist is read into the C-array char lbllist[], then &lbllist[33*(i-1)] points to the start of the label name associated with the ith variable.

5.8 Variable labels

The variable labels associated with each variable are recorded as

<variable_labels>varlbllist</variable_labels>

where varlbllist is a sequence (array) of K 81-character, binary-zero terminated strings. If a variable has no label, the first character of its label is \0.

5.9 Characteristics

Characteristics are used to record information that is unique to Stata and has no equivalent in other data management packages. When writing data, we recommend you write

<characteristics></characteristics>

That leaves the problem of reading a dataset that might contain characteristics. Characteristics are recorded as

<characteristics>...</characteristics>

We recommend you skip over the ... part. Do not, however, merely scan ahead until you encounter the close marker, because the ... part itself might contain a characteristic containing the string "</characteristics>".

The ... part contains zero or more individual characteristics, each appearing as

<ch>llll...............</ch> |-------------| llll bytes

where llll is the length of what follows, recorded as a 4-byte unsigned integer field to be interpreted according to byteorder. Thus, to skip an individual characteristic after reading <ch>, read llll and then skip llll bytes in the file. Then verify that you next read /ch. The marker after that will then be either </characteristics>, meaning you are done, or <ch>, meaning you have yet another individual characteristic to skip.

For those who want to read and write characteristics, the ... part contains the information on the individual characteristic being defined, recorded as

0 33 66 l-1 | | | | varname\0.....charname\0.......contents\0 |---------------------------------------| llll bytes

Bytes 0-32 contain a binary-zero terminated variable name, bytes 33-65 contain a binary-zero terminated characteristic name, and bytes 60 through the end of the record contain the binary-zero terminated ASCII contents of characteristic varname[charname].

5.10 Data

The data are recorded as

<data>data</data>

where data is observation 1 followed by observation 2 followed by ... followed by observation N,

<data>obs1obs2obs3...obsN</data>

and where each observation is variable 1's value followed by variable 2's value ... followed by variable K's value,

<data>v11v12...v1Kv21v22...v2K......vN1vN2...VNK</data>

Each vIJ is recorded in its own internal format, as given by typlist and defined in sections 3 (strfs) and 4 (numeric values). We have not yet explained how strLs are written; we will do that in section 5.11. In the meantime, let us imagine a dataset without strLs.

All values are written per byteorder. Strfs are binary-zero terminated if they are shorter than the allowed space, but they are not terminated if they are full width.

For instance, consider a dataset in which V1 is a float, V2 a byte, V3 a double, and V4 a str6:

. describe

Contains data obs: 2 vars: 4 size: 38 ---------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------- V1 float %9.0g V2 byte %8.0g V3 double %10.0g V4 str6 %9s ---------------------------------------------------------------- Sorted by:

. list

+-----------------------+ | V1 V2 V3 V4 | |-----------------------| 1. | 0 1 2 first | 2. | 1 2 3 second | +-----------------------+

The corresponding <data>...</data> would read (assuming MSF byteorder),

<data>00000000014000000000000000first003f800000024008000000000000 second</data>

Values for variables and observations are run together, but we can more easily understand it if we add nonsignificant white space

<data> 00000000 01 4000000000000000 first00 3f800000 02 4008000000000000 second </data> 1. Each variable's value is written according to its variable type.

V1's value is 4 bytes (8 hexadecimal digits) long because V1 is of type float. What is written is interpreted as a 4-byte IEEE float.

V2's value is 1 byte (2 hexadecimal digits) long because V2 is of type byte. What is written is interpreted as 1-byte signed integer.

V3's value is 8 bytes (16 hexadecimal digits) long because V3 is of type double. What is written is interpreted as 8-byte IEEE float.

2. Look carefully at V4, a str6 taking on values "first" and "second". The value "first" is written as first\0 -- with trailing binary 0. The value "second" is written without a trailing binary 0 because "second" is 6 characters long, which is to say, full length. If another observation contained "dog", it would be written dog\0.. -- a binary 0 would be written, and then two random bytes written so that the length of what was written would be the required 6.

The general rule is that str# is written in a field of # bytes. If the value is # bytes long, no binary 0 is suffixed. If the value is less than # bytes long, a binary 0 is suffixed at the end of the string.

An empty string is always written as \0 and then padded with random bytes if necessary to fill out the required length.

5.11 StrLs

StrLs are long strings. In the above section on <data>...</data>, we saw that the value of each strf -- Stata types str1, str2, ..., str2045 -- is recorded as fixed-length strings.

StrLs can be up to 2,000,000,000 characters long, so they are recorded differently.

If there are no strL variables in the data, <strls>...</strls> is recorded as

<strls></strls>

In section 5.10 we had an example showing the contents of <data>...</data> for a dataset containing four variables and two observations. There were no strLs in that example, and thus the entire <data>...</data> and <strls>...</strls> would read

<data>00000000014000000000000000first003f800000024008000000000000 second</data><strl><strls>

or, with more readable, nonsignificant spaces,

<data> 00000000 01 4000000000000000 first00 3f800000 02 4008000000000000 second </data> <strls> </strls>

Let's take that example's dataset and add a strL variable to it as variable V5:

. describe

Contains data obs: 2 vars: 5 size: 38 ---------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------- V1 float %9.0g V2 byte %8.0g V3 double %10.0g V4 str6 %9s V5 strL %9s ---------------------------------------------------------------- Sorted by:

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first third | 2. | 1 2 3 second fourth | +--------------------------------+

The data for the strL variable are divided between <data>...</data> and <strls>...</strls>. Run together in the .dta file, it looks like this,

<data>00000000014000000000000000first000000000500000001 3f800000024008000000000000second0000000500000002</data> <strls>GSO00000005000000018200000006third00GSO000000050 0000002 82 00000007fourth00</strls>

or, with more readable, nonsignificant spaces,

<data> 00000000 01 4000000000000000 first00 0000000500000001 3f800000 02 4008000000000000 second 0000000500000002 </data> <strls> GSO 0000000500000001 82 00000006third00 GSO 0000000500000002 82 00000007fourth00 </strls>

StrLs are recorded in two parts:

1. In <data>...</data>, each strL is recorded as an 8-byte (16 hex digit) value. These values are to be interpreted as 2 4-byte fields and are known as (v,o) values.

In the first observation, the strL is recorded as (hexadecimal) 0000000500000001, which is the 2 4-byte values, namely, (hexadecimal) 00000005 and 00000001, and which corresponds to (v,o) = (5,1).

In the second observation, the (v,o) value is (5,2).

2. <strls>...</strls> records the mapping of (v,o) values to corresponding strings. In the case of strLs, strings are known as Generic String Objects (GSOs).

In this example, two GSOs are defined. The first is the GSO for (v,o)=(5,1) and the second, the GSO for (5,2).

(v,o)=(5,1) corresponds to "third".

(v,o)=(5,2) correspond to "fourth".

Obviously, there is more information recorded in the GSO than just the (v,o) value and its corresponding string, and we will get to that, but let's focus first on the (v,o) values.

5.11.1 (v,o) values

If our dataset contained variable V5 equaling "third" in both observations,

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first third | 2. | 1 2 3 second third | +--------------------------------+

they could be recorded like this,

<data> 00000000 01 4000000000000000 first00 0000000500000001 3f800000 02 4008000000000000 second 0000000500000002 </data> <strls> GSO 0000000500000001 82 00000006third00 GSO 0000000500000002 82 00000006third00 </strls>

or like this:

<data> 00000000 01 4000000000000000 first00 0000000500000001 3f800000 02 4008000000000000 second 0000000500000001 </data> <strls> GSO 0000000500000001 82 00000006third00 </strls>

Note that there is only one GSO and both observations refer to it by specifying (v,o) as (5,1) in both observations. This is called a shared or cross-linked GSO. Lots of observations can link to the same GSO. By the way, the data could not be recorded like this:

<data> 00000000 01 4000000000000000 first00 0000000500000002 3f800000 02 4008000000000000 second 0000000500000002 </data> <strls> GSO 0000000500000002 82 00000006third00 </strls>

In this example of a mistake, (v,o) is (5,2) rather than (5,1). This is called a forward reference and is not allowed. You might already suspect that (v,o) values are called that because they somehow refer to variable and observation numbers. In <data>...</data>, (v,o) values for variable i, observation j are required to equal i,j or be "before" i,j, which is to say, o<j or, if o==j, v<=i.

(0,0) is a special (v,o) value that refers to a GSO containing (ascii) "" and that you do not need to define (and that you may not define). If variable V5 in the first observation contained (ascii) "",

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first | 2. | 1 2 3 second fourth | +--------------------------------+

the data could be recorded as

<data> 00000000 01 4000000000000000 first00 0000000500000001 3f800000 02 4008000000000000 second 0000000500000002 </data> <strls> GSO 0000000500000001 82 0000000100 GSO 0000000500000002 82 00000007fourth00 </strls>

but that is considered bad style because it causes Stata to waste a little memory. The right way to record the data is

<data> 00000000 01 4000000000000000 first00 0000000000000000 3f800000 02 4008000000000000 second 0000000500000002 </data> <strls> GSO 0000000500000002 82 00000007fourth00 </strls>

In the above, note that (v,o) = (0,0) in the first observation. By the way, if both observations of variable V5 contained (ascii) "", the data would be recorded as

<data> 00000000 01 4000000000000000 first00 0000000000000000 3f800000 02 4008000000000000 second 0000000000000000 </data> <strls> </strls>

The rules for specifying (v,o) values are the following:

1. In <data>...</data>, strLs are recorded as (v,o) values. That means a (v,o) value is specified for each strL variable in each observation.

2. (v,o) values are recorded in an 8-byte field and are interpreted as 2 4-byte unsigned integer values per byteorder.

3. For variable i, observation j, (v,o) = (0,0) if i,j contains (ascii) "".

4. For variable i, observation j, if (v,o) != (0,0), then o<j or, if o==j, v<=i. That is, variable i, observation j either links to its own (v,o) = (i,j) or links to the (v,o) value of a variable and observation that appeared before it in <data>...</data>.

5. The usual case is (v,o) = (i,j).

6. Programs that write .dta files are not required to produce crosslinked (v,o) values when contents of strings are equal.

7. Programs that read .dta files are required to be able to process crosslinked (v,o) values.

5.11.2 GSOs

The markers <strls>....</strls> contain the definitions of zero or more GSOs:

<strls>GSOdefGSOdef...GSOdef</strls>

Each GSO can contain either an ascii or a binary string. We use the following definitions:

A string is ascii-like if it contains no binary zeros and if it contains a binary zero following its last significant character.

A string is ascii if it is ascii-like and it uses the ASCII character encoding

A string that is not ascii is binary.

The format of a GSO record is

o len \ / contents ---- ---- / GSOvvvvooootllllxxxxxxxxxxxxxxx...x ---- - (---- len bytes ----) / | v type

name length contents ----------------------------------------------------------- 3 GSO (fixed string) v 4 unsigned 4-byte integer, v of (v,o) o 4 unsigned 4-byte integer, o of (v,o) t 1 unsigned 1-byte integer len 4 unsigned 4-byte integer contents len contents of strL ----------------------------------------------------------- v, o, and len are recorded per byteorder.

t is encoded: 129 (decimal) binary 130 (decimal) ascii if t==129, contents contains the string AS-IS. len contains the length of contents. if t==130, contents must contain trailing \0. len contains the length of the string including \0. If using C, len = strlen(string) + 1.

Notes:

1. v and o are the (v,o) values defined in <data>...</data>. v and o must follow the rules of specification previously given.

2. Variable v must be of type strL.

3. GSOs must appear in "ascending" order of (v,o). Ascending order is defined as the same order as they appeared in <data>...</data>: ascending v for o==1, followed by ascending v for o==2, ....

4. All (v,o) values that appeared in <data>...</data> must be defined except (v,o) = (0,0). Each may be defined only once.

5. (v,o) = (0,0) may not be defined.

5.11.3 Advice on writing strLs

Writing .dta datasets containing strLs is easy if you do not attempt to link equal strLs. Sometimes, crosslinking is easy, too, depending on how your original data are stored.

Here is pseudocode for writing strLs without crosslinking:

write "<data>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { if (variable i is strL) { if (contents of i != (ascii) "") { write i as 4 bytes write j as 4 bytes } else { write 0 as 8 bytes } } else ... /* the usual */ } } write "</data>"

write "<strls>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { if (variable i is strL) { if (contents of i != (ascii) "") { write GSO for (v,o) = (i,j) } } } } write "</strls>"

5.11.4 Advice on reading strLs

Here is pseudocode for reading strLs (including crosslinking):

read "<data>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { read data the usual way in the case of strLs, just store (v,o) values } } read "</data>"

read "<strls>" for (j=1; j<=N: j++) { for (i=1; i<=K; i++) { if (variable i is strL) { get v and o from data(i,j) if (v==i && o==j) { read GSO up to contents read len bytes of contents store contents in new_dataset(i,j) } else { if (v==0 && o==0) { store (ascii) "" in new_dataset } else if (o<j || (o==j && v<i) { retrieve string new_dataset(v,o > ) ... that you previously stored. store string in new_dataset(i,j > ) } else { abort with error due to ... ... forward reference } } } } } read "</strls>"

5.12 Value labels

Numeric variables in Stata optionally have value labels associated with them. Value labels map numeric values to strings, such as 1 to "male" and 2 to "female". Mappings are named. The mapping of 1 to "male" and 2 to "female" might be named gender. The recording of the names of the mappings optionally associated with variables was discussed in section 5.7. Variable sex might be associated with value label gender.

Here we discuss the recording of the value label definition itself, such as gender. Even if value label gender is used by a variable, it is not required that the corresponding value-label definition be provided.

Value labels are defined by

<value_labels>individual_definitions</value_labels>

where an individual_definitions are each given by

<lbl>def</lbl>

If no individual definitions are provided, the above becomes

<value_labels></value_labels>

If individual definitions are provided, the above becomes

<value_labels><lbl>def</lbl>...<lbl>def</lbl></value_labels>

where def is

len labelname padding value_label_table | | | | llllcccccccccccccccccccccccccccccccccppp................... |-------- 33 characters --------| |--- len bytes ---|

def len format comment ------------------------------------------------------------------- len 4 int length of value_label_table labname 33 char \0 terminated padding 3 value_label_table len see next table -------------------------------------------------------------------

value_label_table len format comment ---------------------------------------------------------- n 4 int number of entries txtlen 4 int length of txt[] off[] 4*n int array txt[] offset table val[] 4*n int array sorted value table txt[] txtlen char text table ----------------------------------------------------------

len, n, txtlen, off[], and val[] are encoded per byteorder. The maximum length of a single label within txt[] is 32,000 characters, or 32,001 including the terminating binary 0. Stata ignores labels that exceed the limit.

For example, the value_label_table for 1<->yes and 2<->no, shown in MSF format, would be

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 00 00 00 00 04 meaning: n = 2 txtlen = 7 off[0] = 0 off[1] = 4 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 y e s 00 n o 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

The interpretation is that there are n=2 values being mapped. The values being mapped are val[0]=1 and val[1]=2. The corresponding text for val[0] would be at off[0]=0 of txt[] (and so be "yes") and for val[1] would be at off[1]=4 of txt[] (and so be "no").

Interpreting this table in C is not as daunting as it appears. Let (char *) p refer to the memory area into which value_label_table is read. Assume your compiler uses 4-byte ints. The following manifests make interpreting the table easier:

#define SZInt 4 #define Off_n 0 #define Off_nxtoff SZInt #define Off_off (SZInt+SZInt) #define Off_val(n) (SZInt+SZInt+n*SZInt) #define Off_txt(n) (Off_val(n) + n*SZInt) #define Len_table(n,nxtoff) (Off_txt(n) + nxtoff)

#define Ptr_n(p) ( (int *) ( ((char *) p) + Off_n ) ) #define Ptr_nxtoff(p) ( (int *) ( ((char *) p) + Off_nxtoff ) ) #define Ptr_off(p) ( (int *) ( ((char *) p) + Off_off ) ) #define Ptr_val(p,n) ( (int *) ( ((char *) p) + Off_val(n) ) ) #define Ptr_txt(p,n) ( (char *) ( ((char *) p) + Off_txt(n) ) )

It is now the case that for(i=0; i < *Ptr_n(p); i++), the value *Ptr_val(p,i) is mapped to the character string Ptr_txt(p,i).

Remember in allocating memory for *p that the table can be big. The limits are n=65,536 mapped values with each value being up to 32,001 characters long (including the null terminating byte). There are n offsets and n numeric values in the table, each 4 bytes long. n itself is 4 bytes, and txtlen is 4 bytes. Such a table would be 2,097,741,832 bytes long ((65536 * (32001 + 4 + 4)) + 4 + 4). No user is likely to approach that limit, and in any case, after reading the 8 bytes preceding the table (n and txtlen), you can calculate the remaining length as 2*4*n+txtlen and malloc() the exact amount.

Constructing the table is more difficult. The easiest approach is to set arbitrary limits equal to or smaller than Stata's as to the maximum number of entries and total text length you will allow and simply declare the three pieces off[], val[], and txt[] according to those limits:

int off[MaxValueForN] ; int val[MaxValueForN] ; char txt[MaxValueForTxtlen] ;

Stata's internal code follows a more complicated strategy of always keeping the table in compressed form and having a routine that will "add one position" in the table. This is slower but keeps memory requirements to be no more than the actual size of the table.

In any case, when adding new entries to the table, remember that val[] must be in ascending order: val[0] < val[1] < ... < val[n].

It is not required that off[] or txt[] be kept in ascending order. We previously offered the example of the table that mapped 1<->yes and 2<->no:

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 00 00 00 00 04 meaning: n = 2 txtlen = 7 off[0] = 0 off[1] = 4 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 y e s 00 n o 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

This table could just as well be recorded as

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 03 00 00 00 00 meaning: n = 2 txtlen = 7 off[0] = 3 off[1] = 0 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 n o 00 y e s 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

but it could not be recorded as

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 04 00 00 00 00 meaning: n = 2 txtlen = 7 off[0] = 4 off[1] = 0 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 01 y e s 00 n o 00 meaning: val[0] = 2 val[1] = 1 txt ---> ---------------------------------------------------------------------

It is not the out-of-order values of off[] that cause problems; it is out-of-order values of val[]. In terms of table construction, we find it easier to keep the table sorted as it grows. This way one can use a binary search routine to find the appropriate position in val[] quickly.

The following routine will find the appropriate slot. It uses the manifests we previously defined, and thus it assumes the table is in compressed form, but that is not important. Changing the definitions of the manifests to point to separate areas would be easy enough.

/* slot = vlfindval(char *baseptr, int val)

Looks for value val in label at baseptr. If found: returns slot number: 0, 1, 2, ... If not found: returns k<0 such that val would go in slot -(k+1) k== -1 would go in slot 0. k== -2 would go in slot 1. k== -3 would go in slot 2. */

int vlfindval(char *baseptr, int myval) { int n ; int lb, ub, try ; int *val ; char *txt ; int *off ; int curval ;

n = *Ptr_n(baseptr) ; val = Ptr_val(baseptr, n) ;

if (n==0) return(-1) ; /* not found, insert into 0 */

/* in what follows, */ /* we know result between [lb,ub */ /* or it is not in the table */ lb = 0 ; ub = n - 1 ; while (1) { try = (lb + ub) / 2 ; curval = val[try] ; if (myval == curval) return(try) ; if (myval<curval) { ub = try - 1 ; if (ub<lb) return(-(try+1)) ; /* because want to insert before try, ergo, want to return try, and transform is -(W+1). */ } else /* myval>curval */ { lb = try + 1 ; if (ub<lb) return(-(lb+1)) ; /* because want to insert after try, ergo, want to return try+1 and transform is -(W+1) */ } } /*NOTREACHED*/ }


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index