Stata 15 help for dta

[P] file formats .dta -- Description of .dta file format

Description

Described below is the format of Stata .dta datasets. The description is highly technical and aimed at programmers who need to write code in C or other languages to read and write Stata .dta files.

The format described here went into effect as of Stata 14 and is known as .dta format 118. It is also the primary format used by Stata 15. For datasets with more than 32,767 variables, Stata 15 uses a slightly different format, 119. For documentation on format 119 see dta_119. For documentation on earlier file formats, see dta_117.

We will highlight in red changes between .dta formats 117 and 118.

Remarks

The format of .dta files has changed over time. Stata 15 writes what are known as .dta format-118 files and can read all formats of files that have ever been released. The recent history of .dta formats is

Format Current as of --------------------------------------- 119 Stata 15 (when dataset has more than 32,767 variables) 118 Stata 14 and Stata 15 117 Stata 13 116 internal; never released 115 Stata 12 114 Stata 10 113 Stata 8 ---------------------------------------

Format 118 is documented below.

Remarks are presented under the following headings:

1. Introduction 2. Versions and flavors of Stata 3. Representation of strings 4. Representation of numbers 5. Dataset format definition 5.1 Header 5.1.1 File format id 5.1.2 Byteorder 5.1.3 K, # of variables 5.1.4 N, # of observations 5.1.5 Dataset label 5.1.6 Datetime stamp 5.2 Map 5.3 Variable types 5.4 Variable names 5.5 Sort order of observations 5.6 Display formats 5.7 Value-label names 5.8 Variable labels 5.9 Characteristics 5.10 Data 5.11 StrLs 5.11.1 (v,o) values 5.11.2 GSOs 5.11.3 Advice on writing strLs 5.11.4 Advice on writing 6-byte integers 5.11.5 Advice on reading strLs 5.11.6 Advice on reading 6-byte integers 5.12 Value labels

1. Introduction

Stata-format datasets record data in a way generalized to work across computers that do not agree on how data are recorded. Thus the same dataset may be used, without translation, on Windows, Unix, and Mac computers. Given a computer, datasets are divided into two categories: native-format and foreign-format datasets. Stata uses the following two rules:

R1. On a given computer, Stata knows how to write native-format datasets only.

R2. Even so, Stata can read all dataset formats, whether foreign or native.

Rules R1 and R2 ensure that Stata users need not be concerned with dataset formats. If you are writing a program to read and write Stata datasets, you will have to determine whether you want to follow the same rules or instead restrict your program to operate on only native-format datasets. Because Stata follows rules R1 and R2, such a restriction would not be too limiting. If the user had a foreign-format dataset, he or she could enter Stata, use the data, and then save it again.

2. Versions and flavors of Stata

Stata is continually being updated, and these updates sometimes require changes be made to how Stata records .dta datasets. This document describes what are known as format-118 datasets, the current format. Stata itself can read older formats, but whenever it writes a dataset, it writes in 118 format. If a dataset has more than 32,767 variables, Stata writes in 119 format.

There are currently three flavors of Stata available: Stata/IC, Stata/SE, and Stata/MP. The same 118 format is used by all flavors. The difference is that datasets can be larger in some flavors.

3. Representation of strings

Strings are encoded UTF-8 in Stata. We are referring to all strings, whether data, variable names, display formats, etc.

Each UTF-8 character consumes 1 to 4 bytes of storage. Thus the byte length and character length of UTF-8 strings differ. A string containing 5 characters can have a byte length of anywhere from 5 to 20.

Stata generally places a binary-0 (\0) at the end of strings. There are a few exceptions, so read .dta-format specifications carefully where strings are involved.

The recording of variable names is an example of the trailing \0. Stata allows variable names of up to 32 characters in length. That means 32*4+1 = 129 bytes must be allocated for storing variable names.

ASCII is a proper subset of UTF-8. UTF-8 strings between 0x01 to 0x7e inclusive have the usual ASCII interpretation.

Let's now turn to strings stored in data (variables and observations).

1. Strings stored in data are UTF-8 encoded.

2. Stata has two storage formats for strings, known to users as str# and strL. # records the number of bytes required to store the string. Most strings are recorded in str# format, but that is up to the user. The strL storage format allows for longer strings, and it allows for binary (generic) strings.

By the way, StataCorp's internal jargon is to refer to str# strings as "strfs" (pronounced sturfs) and to strLs as "strLs" (pronounced sturls). The f in strf stands for fixed allocation length, which is how strfs are written in the file.

3. We discuss strL format strings in section 5.11.

4. Strfs are recorded with a trailing binary-zero (\0) delimiter if the byte length of the string is less than the maximum declared length. The string is recorded without the delimiter if the string is of the maximum length. Thus the observations of a str40 variable can contain strings of 0 to 40 bytes in length.

Just to be clear, we will consider a str4 variable. In the first observation, the value of the variable might be "Mary". "Mary" would be stored in the 4-byte field without a trailing \0. In the second observation, the value might be "Bob". "Bob" would be stored as "Bob\0".

5. Leading and trailing blanks are significant.

4. Representation of numbers

1. Numbers are represented as 1-, 2-, and 4-byte integers and 4- and 8-byte floats. In the case of floats, ANSI/IEEE Standard 754-1985 format is used, which in the case of the binary floating-point numbers considered here is equivalent to IEEE Standard 754-2008.

2. Byte ordering varies across machines for all numeric types. Bytes are ordered either least significant to most significant, dubbed LSF, or most significant to least significant, dubbed MSF. Intel-based computers, for instance, mostly use LSF encoding. Sun SPARC-based computers use MSF encoding. Itanium-based computers are interesting in that they can be either LSF or MSF depending on the operating system. Windows and Linux on Itanium use LSF encoding. HP-UX on Itanium uses MSF encoding.

3. When reading an MSF number on an LSF machine or an LSF number on an MSF machine, perform the following before interpreting the number:

byte no translation necessary 2-byte int swap bytes 0 and 1 4-byte int swap bytes 0 and 3, 1 and 2 4-byte float swap bytes 0 and 3, 1 and 2 8-byte float swap bytes 0 and 7, 1 and 6, 2 and 5, 3 and 4

4. For purposes of written documentation, numbers are written with the most significant byte listed first. Thus 0x0001 refers to a 2-byte integer taking on the logical value 1.

5. Stata has five numeric data types. They are

byte 1-byte signed int int 2-byte signed int long 4-byte signed int float 4-byte IEEE float double 8-byte IEEE float

6. Each type allows for 27 missing value codes, known as ., .a, .b, ..., .z. For each type, the range allowed for nonmissing values and the missing value codes is

byte minimum nonmissing -127 (0x80) maximum nonmissing +100 (0x64) code for . +101 (0x65) code for .a +102 (0x66) code for .b +103 (0x67) ... code for .z +127 (0x7f)

int minimum nonmissing -32767 (0x8000) maximum nonmissing +32740 (0x7fe4) code for . +32741 (0x7fe5) code for .a +32742 (0x7fe6) code for .b +32743 (0x7fe7) ... code for .z +32767 (0x7fff)

long minimum nonmissing -2,147,483,647 (0x80000000) maximum nonmissing +2,147,483,620 (0x7fffffe4) code for . +2,147,483,621 (0x7fffffe5) code for .a +2,147,483,622 (0x7fffffe6) code for .b +2,147,483,623 (0x7fffffe7) ... code for .z +2,147,483,647 (0x7fffffff)

float minimum nonmissing -1.701e+38 (-1.fffffeX+7e) (sic) maximum nonmissing +1.701e+38 (+1.fffffeX+7e) code for . (+1.000000X+7f) code for .a (+1.001000X+7f) code for .b (+1.002000X+7f) ... code for .z (+1.01a000X+7f)

double minimum nonmissing -1.798e+308 (-1.fffffffffffffX+3ff) maximum nonmissing +8.988e+307 (+1.fffffffffffffX+3fe) code for . (+1.0000000000000X+3ff) code for .a (+1.0010000000000X+3ff) code for .b (+1.0020000000000X+3ff) ... code for .z (+1.01a0000000000X+3ff)

Note that for float, all z>1.fffffeX+7e, and for double, all z>1.fffffffffffffX+3fe are considered to be missing values, and it is merely a subset of the values that are labeled ., .a, .b, ..., .z. For example, a value between .a and .b is still considered to be missing, and in particular, all the values between .a and .b are known jointly as .a_. Nevertheless, the recording of those values should be avoided.

In the table above, we have used the {+|-}1.<digits>X{+|-}<digits> notation. The number to the left of the X is to be interpreted as a base-16 number (the period is thus the base-16 point) and the number to the right (also recorded in base 16) is to be interpreted as the power of 2 (sic). For example,

1.01aX+3ff = (1.01a) * 2^(3ff) (base 16) = (1 + 0/16 + 1/16^2 + 10/16^3) * 2^1023 (base 10)

The {+|-}1.<digits>X{+|-}<digits> notation easily converts to IEEE 8-byte double: the 1 is the hidden bit, the digits to the right of the hexadecimal point are the mantissa bits, and the exponent is the IEEE exponent in signed (removal of offset) form. For instance, pi = 3.1415927... is

8-byte IEEE, MSF ----------------------- pi = +1.921fb54442d18X+001 = 40 09 21 fb 54 44 2d 18

= 18 2d 44 54 fb 21 09 40 ----------------------- 8-byte IEEE, LSF

Converting {+|-}1.<digits>X{+|-}<digits> to IEEE 4-byte float is more difficult, but the same rule applies: the 1 is the hidden bit, the digits to the right of the hexadecimal point are the mantissa bits, and the exponent is the IEEE exponent in signed (removal of offset) form. What makes it more difficult is that the sign-and-exponent in the IEEE 4-byte format occupy 9 bits, which is not divisible by four, and so everything is shifted one bit. In float:

4-byte IEEE, MSF ----------- pi = +1.921fb60000000X+001 = 40 49 0f db

= db of 49 40 ----------- 4-byte IEEE, LSF

The easiest way to obtain the above result is to first convert +1.921fb60000000X+001 to an 8-byte double and then convert the 8-byte double to a 4-byte float.

In any case, the relevant numbers are

V value MSF LSF --------------------------------------------------------------- m -1.fffffffffffffX+3ff ffefffffffffffff ffffffffffffefff M +1.fffffffffffffX+3fe 7fdfffffffffffff ffffffffffffdf7f . +1.0000000000000X+3ff 7fe0000000000000 000000000000e07f .a +1.0010000000000X+3ff 7fe0010000000000 000000000001e07f .b +1.0020000000000X+3ff 7fe0020000000000 000000000002e07f .z +1.01a0000000000X+3ff 7fe01a0000000000 00000000001ae07f

m -1.fffffeX+7e feffffff fffffffe M +1.fffffeX+7e 7effffff ffffff7e . +1.000000X+7f 7f000000 0000007f .a +1.001000X+7f 7f000800 0008007f .b +1.002000X+7f 7f001000 0010007f .z +1.01a000X+7f 7f00d000 00d0007f ---------------------------------------------------------------

5. Dataset format definition

A Stata dataset containing two variables named myfloat and myint -- myfloat a Stata 4-byte float and myint a Stata 2-byte int -- and having one observation with myfloat = myint = 0 and written to disk on a dataset written on 10 March 2017 at 2:23 p.m. would look like this:

---------------------------------------- top of file ----- <stata_dta>

<header> <release>118</release> <byteorder>MSF</byteorder> <K>0002</K> <N>0000000000000001</N> <label>000bSample Data</label> <timestamp>1110 Mar 2017 14:23</timestamp> </header>

<map> 0000000000000000 0000000000000099 0000000000000141 0000000000000139 0000000000000190 00000000000001ab 0000000000000220 000000000000034e 0000000000000371 0000000000000384 0000000000000393 00000000000003b0 00000000000003bc </map>

<variable_types> fff7 fff9 </variable_types>

<varnames> myfloat00........................ myint00.......................... </varnames>

<sortlist> 000000000000 </sortlist>

<formats> %9.0g00............................... %8.0g00............................... </formats>

<value_label_names> 00................................ 00................................ </value_label_names>

<variable_labels> 00................................................ 00................................................ </variable_labels>

<characteristics> </characteristics>

<data> 000000000000 </data>

<strls> </strls>

<value_labels> </value_labels>

</stata_dta> ---------------------------------------- end of file -----

We have taken liberties in the spacing of the presentation. The file is actually run together, so it looks more like this,

---------------------------------------- top of file ----- <stata_dta><header><release>118</release><byteorder>MSF</by teorder><K>0002</K><N>0000000000000001</N><label>000bSample Data</label><timestamp>1110 Mar 2017 14:23</timestamp></he ader><map>0000000000000000000000000000009900000000000001410 000000000000139000000000000019000000000000001ab000000000000 0220000000000000034e000000000000037100000000000003840000000 00000039300000000000003b000000000000003bc</map><variable_ty pes>fff7fff9</variable_types><varnames>myfloat00........... .............myint00..........................</varnames><s ortlist>000000000000</sortlist><formats>%9.0g00........... ...................%8.0g00...............................</ formats><value_label_names>00.............................. ..00................................</value_label_names><va riable_labels>00........................................... ......00................................................</v ariable_labels><characteristics></characteristics><data>000 000000000</data><strls></strls><value_labels></value_labels ></stata_dta> ---------------------------------------- end of file -----

We show binary content using hexadecimal values in italics. 00, for instance, means 1-byte binary zero. The 11 following <timestamp> means one byte recording hexadecimal 11, equivalent to decimal 17, and 17 is the length of the datestamp that follows it: "10 Mar 2017 14:23". We show bytes that may contain random values -- that are and should be ignored -- using a period. We have omitted some of the period bytes. For instance, we show only 32 of the 129 bytes allocated for variable names.

A 118-format .dta file begins with <stata_dta> and ends with </stata_dta>:

<stata_dta>..........</stata_dta> / \ start of file end of file

Between those markers appear

header <header>..............</header> file map <map>.................</map> variable types <variable_types>......</variable_types> variable names <varnames>............</varnames> sort order <sortlist>............</sortlist> variable %fmts <formats>.............</formats> value-label names <value_label_names>...</value_label_names> variable labels <variable_labels>.....</variable_labels> characteristics <characteristics>.....</characteristics> data themselves <data>................</data> strLs <strls>...............</strls> value labels <value_labels>........</value_labels>

Each marker pair must appear even if the content is empty, and the marker pairs must appear in the order shown.

5.1 Header

The Header is defined as

<header>...</header>

Between those markers appear

file format id <release>...</release> byteorder <byteorder>...</byteorder> # of variables <K>...</K> # of observations <N>...</N> dataset label <label>...</label> datetime stamp <timestamp>...</timestamp>

Each marker must appear, and it must appear in the order shown.

5.1.1 File format id

The file_format_id is recorded as

<release>118</release>

5.1.2 Byteorder

The byteorder is recorded as

<byteorder>byteorder</byteorder>

where byteorder is either MSF or LSF.

MSF stands for Most Significant byte First. In this encoding, the number 1 recorded as a 2-byte integer would be 00 followed by 01: 0001.

LSF stands for Least Significant byte First. In this encoding, the number 1 recorded as a 2-byte integer would be 01 followed by 00: 0100.

5.1.3 K, # of variables

K, the number of variables stored in the dataset, is recorded as

<K>bb</K>

where K = bb is a 2-byte unsigned integer field recorded according to byteorder.

5.1.4 N, # of observations

N, the number of observations stored in the dataset, is recorded as

<N>bbbbbbbb</N>

Where N = bbbbbbbb is an 8-byte unsigned integer field recorded according to byteorder. In format 117, N was written in a 4-byte field.

5.1.5 Dataset label

The dataset label is recorded as

<label>llccccc........c</label> |------------| ll bytes

Requirements:

ccc..c Up to 80 UTF-8 characters. UTF-8 characters each require 1 to 4 bytes. No trailing \0 is written.

ll The byte length of the UTF-8 characters, whose length is recorded in a 2-byte unsigned integer encoded according to byteorder.

Because ccc..c is allowed to contain up to 80 characters, 0 <= ll <= 4*80 (4*80 = 320 = 0x140).

If no characters are recorded (there is no data label), the .dta file contains

<label>0000</label>

where 0000 represents 2 bytes of 0.

5.1.6 Datetime stamp

The datetime stamp records the date and time the file was written. The datetime stamp is recorded as

<timestamp>lccccc........c</timestamp> |------------| l characters

No trailing \0 is written.

The length l of the datetime stamp is recorded as a 1-byte unsigned integer, followed by the characters of the date-time stamp.

l is required to be 0 or decimal 17. If l==0, then no datetime stamp is recorded. If l==(decimal) 17, the datetime stamp is recorded in the format

----+----1----+-- dd Mon yyyy hh:mm such as 10 Mar 2017 14:23

If dd<10 or hh<10, the element is recorded with a leading blank or a leading zero:

04 Jul 2032 04:23 4 Mar 2017 4:23

5.2 Map

The map has to do with the position in the file, not the Stata data themselves. The map is recorded as

<map>filepositions</map>

where filepositions is a list (array) of 14 8-byte offsets from the start of the file, written according to byteorder. The 14 positions to be recorded are

# file position of the start of the ----------------------------------------------- 1. <stata_data>, definitionally 0 2. <map> 3. <variable_types> 4. <varnames> 5. <sortlist> 6. <formats> 7. <value_label_names> 8. <variable_labels> 9. <characteristics> 10. <data> 11. <strls> 12. <value_labels> 13. </stata_data> 14. end-of-file -----------------------------------------------

Notes:

1. File positions are values that can be obtained from and set by C function lseek(). File positions are obtained by lseek(fd, 0, SEEK_CUR) just before writing the marker listed above or, in the case of end-of-file, just after writing </stata_data>.

2. If you are writing a file, we recommend writing <map>...</map> with all file positions filled in with zero as you are proceeding sequentially and tracking the file positions in a structure such as

struct mapdef { off_t hdr ; off_t map ; off_t types ; off_t varnames ; off_t srtlist ; off_t fmts ; off_t vlblnames ; off_t varlabs ; off_t chars ; off_t data ; off_t strls ; off_t vallabs ; off_t tlr ; off_t bot ; } ;

Record file positions in the structure just before writing the corresponding marker. Once you have written </stata_data>, seek to map+5 and rewrite the file positions. Then close() the file.

3. Note that file positions are 8 bytes long, as they would be on a 64-bit computer. If you are on a 32-bit computer, you must set the most-significant 4 bytes to 0 and record your 32-bit file positions in the least-significant 4 bytes. If you are on a MSF computer, you write each file position by first writing 4 bytes of 0 and then the 4-byte file position. If you are on a LSF computer, you write each file position by writing the 4-byte file position and then writing 4 bytes of 0.

5.3 Variable types

Variable types are recorded as

<variable_types>typlist</variable_types>

where typlist is a sequence (array) of K 2-byte unsigned integer fields written according to byteorder and recording the variable type of variable 1, 2, ..., K.

The types are encoded

Stata typ meaning Description ---------------------------------------------------------- 1 str1 1 byte strf 2 str2 2 or fewer bytes strf ... etc. 2045 str2045 2,045 or fewer bytes strf

32768 strL strL of any length

65526 double 8-byte float 65527 float 4-byte float 65528 long 4-byte signed integer 65529 int 2-byte signed integer 65530 byte 1-byte signed integer ----------------------------------------------------------

5.4 Variable names

Variable names are recorded as

<varnames>varnamelist</varnames>

where varnamelist is a sequence (array) of K 129-byte, \0 terminated, UTF-8 variable names. Each name may be 1 to 32 UTF-8 characters. Hence, the 129-byte length; 129 = 32*4+1.

For instance, if K==4, varnamelist might be

0 129 258 387 | | | | vbl1\0...myvar\0...thisvar\0...lstvar\0...

The above states that variable 1 is named vbl1, variable 2 myvar, variable 3 thisvar, and variable 4 lstvar. The byte positions indicated by periods will contain random values (and note that we have omitted some of the periods).

If varnamelist is read into the C-array char varnamelist[], then &varnamelist[(i-1)*129] points to the name of the ith variable, 1 <= i <= K.

5.5 Sort order of observations

The sort order in which the observations will be subsequently recorded is recorded as

<sortlist>sortlist</sortlist>

where sortlist is a sequence (array) of K+1 unsigned 2-byte integers recorded according to byteorder.

sortlist specifies the sort-order of the dataset and is terminated by a 2-byte zero (0000 in hex). Each 2-byte element contains either a variable number or zero. The zero marks the end of the sortlist, and the recorded positions after that contain random junk. For instance, if the data are not sorted, the first 2-byte integer will contain a zero, and the 2-byte integers thereafter will contain junk. If nvar==4, the record will appear as

0000................

If the dataset is sorted by one variable, say myvar, and if that variable is the second variable in the varnamelist, the record will appear as

00020000............ (if byteorder==MSF) 02000000............ (if byteorder==LSF)

If the dataset is sorted by myvar and within myvar by vbl1, and if vbl1 is the first variable in the dataset, the record will appear as

000200010000........ (if byteorder==MSF) 020001000000........ (if byteorder==LSF)

If sortlist were read into the C-array short int srtlist[], then srtlist[0] would be the variable number of the first sort variable or, if the data were not sorted, 0. If the number is not 0, srtlist[1] would be the variable number of the second sort variable or, if there is not a second sort variable, 0, and so on.

5.6 Display formats

The display formats associated with each variable are recorded as

<formats>fmtlist</formats>

where fmtlist is a sequence (array) of K 57-byte, \0 terminated, UTF-8 display formats for each variable in the data. Display formats are an exception to the rule that the maximum number of characters is (bytelen-1)/4. This is because some characters that appear in a display format, such as "%", numbers, ".", etc., must be 1-byte long in UTF-8 (ASCII). As a result,

THERE IS NO SEPARATE MAXIMUM CHARACTER LENGTH THAT NEEDS TO BE CHECKED.

It is sufficient to check only that the byte length of the format is less than or equal to 57.

Here is an example of fmtlist:

0 57 114 171 228 285 | | | | | | %9.0f\0..%8.2f\0..%20.0g\0..%td\0..%tcDDmonCCYY_HH:MM:SS.sss\0...

fmtlist specifies that variable 1 has a %9.0f format, variable 2 a %8.2f format, variable 3 a %20.0g format, and so on. Note that these are Stata formats, not C formats.

1. Formats beginning with %t or %-t are Stata's date and time formats.

2. Stata has an old %d format notation, and some datasets still have them. Format %d... is equivalent to modern format %td... and %-d... is equivalent to %-td...

3. Nondate formats ending in gc or fc are similar to C's g and f formats, but with commas. Most routines translated out of Stata would ignore the ending c (change it to \0).

4. Formats may contain commas rather than periods, such as %9,2f, indicating European format.

If fmtlist is read into the C-array char fmtlist[], then &fmtlist[57*(i-1)] refers to the starting address of the format for the ith variable.

5.7 Value-label names

The value-label names associated with each variable are recorded as

<value_label_names>lbllist</value_label_names>

where lbllist is a sequence (array) of K 129-byte, \0-terminated, UTF-8 label names. Each name may be up to 32-characters in length.

lbllist contains the names of the value formats associated with the variables 1, ..., K. For instance,

0 129 258 387 | | | | \0...yesno\0...\0...yesno\0...

indicates that variables 1 and 3 have no value label associated with them, whereas variables 2 and 4 are both associated with the value label named yesno.

If lbllist is read into the C-array char lbllist[], then &lbllist[129*(i-1)] points to the start of the label name associated with the ith variable.

5.8 Variable labels

The variable labels associated with each variable are recorded as

<variable_labels>varlbllist</variable_labels>

where varlbllist is a sequence (array) of K 321-byte, \0 terminated, variable-label strings. If a variable has no label, the first byte of its label is \0.

5.9 Characteristics

Characteristics are used to record information that is unique to Stata and has no equivalent in other data management packages. When writing data, we recommend you write

<characteristics></characteristics>

That leaves the problem of reading a dataset that might contain characteristics. Characteristics are recorded as

<characteristics>...</characteristics>

We recommend you skip over the ... part. Do not, however, merely scan ahead until you encounter the close marker, because the ... part itself might contain a characteristic containing the string "</characteristics>".

The ... part contains zero or more individual characteristics, each appearing as

4 bytes |--| <ch>llll...............</ch> |-------------| llll bytes

where llll is the length of what follows, recorded as a 4-byte unsigned integer field to be interpreted according to byteorder. Thus to skip an individual characteristic after reading <ch>, read 4 bytes (llll) and then skip llll bytes in the file. Then verify that you next read /ch. The marker after that will then be either </characteristics>, meaning you are done, or <ch>, meaning you have yet another individual characteristic to skip.

For those who want to read and write characteristics, the ... part contains the information on the individual characteristic being defined, recorded as

0 129 258 l-1 | | | | varname\0.....charname\0.......contents\0 |---------------------------------------| llll bytes

Bytes 0-129 contain a \0 terminated, UTF-8 encoded variable name, bytes 129-257 contain a \0 terminated, UTF-8 encoded characteristic name, and bytes 258 through the end of the record contain the binary-zero terminated UTF-8 contents of characteristic varname[charname].

The maximum allowed byte length of the contents, including the binary 0, is 67,784.

5.10 Data

The data are recorded as

<data>data</data>

where data is observation 1 followed by observation 2 followed by ... followed by observation N,

<data>obs1obs2obs3...obsN</data>

and where each observation is variable 1's value followed by variable 2's value ... followed by variable K's value,

<data>v11v12...v1Kv21v22...v2K......vN1vN2...VNK</data> |-------------||-------| .... |----------| | obs. 1 || obs. 2| .... | obs. N |

Each vIJ is recorded in its own internal format, as given by typlist and defined in sections 3 (strfs) and 4 (numeric values). We have not yet explained how strLs are written; we will do that in section 5.11. In the meantime, let us imagine a dataset without strLs.

All values are written per byteorder. Strfs are binary-zero terminated if they are shorter than the allowed space, but they are not terminated if they are full width.

For instance, consider a dataset in which V1 is a float, V2 a byte, V3 a double, and V4 a str6:

. describe

Contains data obs: 2 vars: 4 size: 38 ---------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------- V1 float %9.0g V2 byte %8.0g V3 double %10.0g V4 str6 %9s ---------------------------------------------------------------- Sorted by:

. list

+-----------------------+ | V1 V2 V3 V4 | |-----------------------| 1. | 0 1 2 first | 2. | 1 2 3 second | +-----------------------+

The corresponding <data>...</data> would read (assuming MSF byteorder),

<data>00000000014000000000000000first003f800000024008000000000000 second</data>

Values for variables and observations are run together, but we can more easily understand it if we add nonsignificant white space

<data> 00000000 01 4000000000000000 first00 3f800000 02 4008000000000000 second </data> 1. Each variable's value is written according to its variable type.

V1's value is 4 bytes (8 hexadecimal digits) long because V1 is of type float. What is written is interpreted as a 4-byte IEEE float.

V2's value is 1 byte (2 hexadecimal digits) long because V2 is of type byte. What is written is interpreted as 1-byte signed integer.

V3's value is 8 bytes (16 hexadecimal digits) long because V3 is of type double. What is written is interpreted as 8-byte IEEE float.

2. Look carefully at V4, a str6 taking on values "first" and "second". The value "first" is written as first\0 -- with trailing binary 0. The value "second" is written without a trailing binary 0 because "second" is 6 bytes long, which is to say, full length. If another observation contained "dog", it would be written dog\0.. -- a binary 0 would be written, and then two random bytes written so that the length of what was written would be the required 6.

The general rule is that str# is written in a field of # bytes. If the value is # bytes long, no binary 0 is suffixed. If the value is less than # bytes long, a binary 0 is suffixed at the end of the string.

An empty string is always written as \0 and then padded with random bytes if necessary to fill out the required length.

5.11 StrLs

StrLs are long strings. In the above section on <data>...</data>, we saw that the value of each strf -- Stata types str1, str2, ..., str2045 -- is recorded as fixed-length strings.

StrLs can be up to 2 billion bytes long, so they are recorded differently.

If there are no strL variables in the data, <strls>...</strls> is recorded as

<strls></strls>

In section 5.10 we had an example showing the contents of <data>...</data> for a dataset containing four variables and two observations. There were no strLs in that example, and thus the entire <data>...</data> and <strls>...</strls> would read

<data>00000000014000000000000000first003f800000024008000000000000 second</data><strl><strls>

or, with more readable, nonsignificant spaces,

<data> 00000000 01 4000000000000000 first00 3f800000 02 4008000000000000 second </data> <strls> </strls>

Let's take that example's dataset and add a strL variable to it as variable V5:

. describe

Contains data obs: 2 vars: 5 size: 38 ---------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------- V1 float %9.0g V2 byte %8.0g V3 double %10.0g V4 str6 %9s V5 strL %9s ---------------------------------------------------------------- Sorted by:

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first third | 2. | 1 2 3 second fourth | +--------------------------------+

The data for the strL variable are divided between <data>...</data> and <strls>...</strls>. Run together in the .dta file, it looks like this,

<data>00000000014000000000000000first000005000000000001 3f800000024008000000000000second0005000000000002</data> <strls>GSO00050000000000018200000006third00GSO000000000 0000005000000028200000007fourth00</strls>

or, with more readable, nonsignificant spaces,

<data> 00000000 01 4000000000000000 first00 000500000000000001 3f800000 02 4008000000000000 second 000500000000000002 </data> <strls> GSO 00000005 0000000000000001 82 00000006 third00 GSO 00000005 0000000000000002 82 00000007 fourth00 </strls>

StrLs are recorded in two parts:

1. In the more readable display of <data>...</data>, we've put each observation on a separate line, and we've put nonsignificant blanks between variables. Here's the <data>...</data> part again:

<data> 00000000 01 4000000000000000 first00 000500000000000001 3f800000 02 4008000000000000 second 000500000000000002 </data>

The StrL variable is the last one -- the one in red -- but that's not why it's in red. Red means a change from the previous .dta format. Anyway, the two StrL values are recorded as 000500000000000001 and 000500000000000002.

000500000000000001 and 000500000000000002 each represent an 8-byte field. Interpret that 8-byte field as a 2-byte integer followed by a 6-byte integer:

000500000000000001 = 0005 00000000000001 = 5, 1

000500000000000002 = 0005 00000000000002 = 5, 2

The two values in each observation are called (v,o) values. v and o stand for "variable" and "observation". They indicate that the strL for variable 5, observation 1, is found in the <strl>...</strl> table for variable 5 and observation 1 and that the strL for variable 5 and observation 2, is found in the strL table for variable 5, observation 2.

Well, where else would they be? The fact is that if two strLs are equal, across observations or even across variables or across variables and observations, then the (v,o) values can differ from the variable and observation being recorded. They can cross-reference other variables and observations, and that saves memory. Usually, however, (v,o) equals the variable and observation being recorded in <data>...</data>.

Before moving on to the explanation of the <strl>...</strl> table, we will talk a little about this 2-byte plus 6-byte encoding of (v,o) in <data>...</data>.

The use of a 6-byte integer is awfully odd. In the previous .dta format, the (v,o) values were written as two 4-byte values. Since then, Stata has learned to deal with more observations, and o no longer can be stored in just 4 bytes. Stata allows up to 281 terabyte observations, and that means a bigger integer is required to store o. An 8-byte integer would have been the natural choice. But, for our own reasons, we needed this field to still be 8 bytes in length. So we split it into 2 bytes plus 6 bytes, and that's adequate for our purposes. See 5.11.4 and 5.11.6 for C code for writing and reading 6-byte integers.

2. <strls>...</strls> records the mapping of (v,o) values to corresponding strings. In the case of strLs, strings are known as Generic String Objects (GSOs). Let's repeat the readable form of <strls>...</strls> from our example:

<strls> GSO 00000005 0000000000000001 82 00000006 third00 GSO 00000005 0000000000000002 82 00000007 fourth00 </strls>

In this example, two GSOs are defined. The first is the GSO for (v,o)=(5,1) and the second, the GSO for (5,2). This time, there is no 2-byte, 6-byte silliness. v is recorded as a 4-byte integer, and o is recorded as an 8-byte integer.

(v,o)=(5,1) corresponds to "third".

(v,o)=(5,2) correspond to "fourth".

Obviously, there is more information recorded in the GSO than just the (v,o) value and its corresponding string, and we will get to that, but let's focus first on the (v,o) values.

5.11.1 (v,o) values

If our dataset contained variable V5 equaling "third" in both observations,

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first third | 2. | 1 2 3 second third | +--------------------------------+

they could be recorded as two separate strLs,

<data> 00000000 01 4000000000000000 first00 000500000000000001 3f800000 02 4008000000000000 second 000500000000000002 </data> <strls> GSO 00000005 0000000000000001 82 00000006 third00 GSO 00000005 0000000000000002 82 00000006 third00 </strls> or like this:

<data> 00000000 01 4000000000000000 first00 000500000000000001 3f800000 02 4008000000000000 second 000500000000000001 </data> <strls> GSO 00000005 0000000000000001 82 00000006 third00 </strls>

Note that there is only one GSO in the second form, and both observations refer to it by specifying (v,o) as (5,1) in <data>...</data>. This is called a shared or cross-linked GSO. Many observations can link to the same GSO.

By the way, the data could not be recorded like this:

<data> 00000000 01 4000000000000000 first00 000500000000000002 3f800000 02 4008000000000000 second 000500000000000002 </data> <strls> GSO 00000005 0000000000000002 82 00000006 third00 </strls>

The strL must be defined the first time it occurs as you read the <data>...</data> table from left to right and then down. The string "third" first occurs in variable 5, observation 1. A strL is said to be defined in variable i, observation j, if it's (v,o) == (i,j). After that, you can make backward references to the defined (v,o) values or define new ones. Forward references are not allowed.

(v,o) = (0,0) is a special allowed value that refers to a GSO containing an empty string ("") that is predefined for you (and that you must not redefine in the <strl>...</strl> table). If variable V5 in the first observation contained an empty string,

. list +--------------------------------+ | V1 V2 V3 V4 V5 | |--------------------------------| 1. | 0 1 2 first | 2. | 1 2 3 second fourth | +--------------------------------+

the data could be recorded as

<data> 00000000 01 4000000000000000 first00 0005000000000001 3f800000 02 4008000000000000 second 0005000000000002 </data> <strls> GSO 00000005 0000000000000001 82 0000000100 GSO 00000005 0000000000000002 82 00000007fourth00 </strls>

but that is considered bad style because it causes Stata to waste a little memory. The right way to record the data is

<data> 00000000 01 4000000000000000 first00 0000000000000000 3f800000 02 4008000000000000 second 0005000000000002 </data> <strls> GSO 00000005 0000000000000002 82 00000007fourth00 </strls>

In the above, (v,o) = (0,0) in the first observation. By the way, if both observations of variable V5 contained empty string, would be recorded as

<data> 00000000 01 4000000000000000 first00 0000000000000000 3f800000 02 4008000000000000 second 0000000000000000 </data> <strls> </strls>

The rules for specifying (v,o) values are the following:

1. In <data>...</data>, strLs are recorded as (v,o) values. That means a (v,o) value is specified for each strL variable in each observation.

2. (v,o) values are recorded in an 8-byte field and are interpreted as a 2-byte unsigned integer followed by a 6-byte unsigned integer.

3. For variable i, observation j, (v,o) = (0,0) if i,j contains an empty string ("").

4. For variable i, observation j, if (v,o) != (0,0), then o<j or, if o==j, v<=i. That is, variable i, observation j either links to its own (v,o) = (i,j) or links to the (v,o) value of a variable and observation that appeared before it in <data>...</data>.

5. The usual case is (v,o) = (i,j).

6. Programs that write .dta files are not required to produce crosslinked (v,o) values when contents of strings are equal.

7. Programs that read .dta files are required to be able to process crosslinked (v,o) values.

5.11.2 GSOs

The markers <strls>....</strls> contain the definitions of zero or more GSOs:

<strls>GSOdefGSOdef...GSOdef</strls>

Each GSO can contain either a UTF-8 or a binary string. You specify the string in the GSO. Use the following definition: string must be binary if it contains a binary 0 that is not used as an extra terminator.

The format of a GSO record is

o len \ / contents -------- ---- / GSOvvvvooooooootllllxxxxxxxxxxxxxxx...x ---- - [--- len bytes ---] / | v type

name length contents ----------------------------------------------------------- 3 GSO (fixed string) v 4 unsigned 4-byte integer, v of (v,o) o 8 unsigned 8-byte integer, o of (v,o) t 1 unsigned 1-byte integer len 4 unsigned 4-byte integer contents len contents of strL ----------------------------------------------------------- v, o, and len are recorded per byteorder.

t is encoded: 129 (decimal) binary 130 (decimal) ascii if t==129, contents contains the string AS-IS. len contains the length of contents. if t==130, contents must contain trailing \0. len contains the length of the string including \0. If using C, len = strlen(string) + 1. Notes:

1. v and o are the (v,o) values defined in <data>...</data>. v and o must follow the rules of specification previously given.

2. Variable v must be of type strL.

3. GSOs must appear in "ascending" order of (v,o). Ascending order is defined as the same order as they appeared in <data>...</data>: ascending v for o==1, followed by ascending v for o==2, ....

4. All (v,o) values that appeared in <data>...</data> must be defined except (v,o) = (0,0). Each may be defined only once.

5. (v,o) = (0,0) may not be defined.

5.11.3 Advice on writing strLs

Writing .dta datasets containing strLs is easy if you do not attempt to link equal strLs. Sometimes, crosslinking is easy, too, depending on how your original data are stored.

Here is pseudocode for writing strLs without crosslinking:

write "<data>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { if (variable i is strL) { if (contents of i != "") { write i as 2 bytes (see 5.11.4) write j as 6 bytes (see 5.11.4)

} else { write 0 as 8 bytes } } else ... /* the usual */ } } write "</data>"

write "<strls>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { if (variable i is strL) { if (contents of i != "") { write GSO for (v,o) = (i,j) } } } } write "</strls>"

5.11.4 Advice on writing 6-byte integers

#define Int8 longlong /* or whatever is the type of an 8-byte intege > r */

#define MSF_Start_of_Int6_in_Int8(p) (((char *) (p))+2) #define LSF_Start_of_Int6_in_Int8(p) ((char *) (p))

/* code for byteorder MSF */ void MSF_insert_vo_byteorder(char s[/*[8]*/], int v, Int8 o) { unsigned short int i2 ; i2 = v ; memcpy(s, &i2, 2) ; memcpy(s+2, MSF_Start_of_Int6_in_Int8(&o), 6) ; }

/* code for byteorder LSF */ void LSF_insert_vo_byteorder(char s[/*[8]*/], int v, Int8 o) { unsigned short int i2 ; i2 = v ; memcpy(s, &i2, 2) ; memcpy(s+2, LSF_Start_of_Int6_in_Int8(&o), 6) ; }

5.11.5 Advice on reading strLs

Here is pseudocode for reading strLs (including crosslinking):

read "<data>" for (j=1; j<=N; j++) { for (i=1; i<=K; i++) { read data the usual way in the case of strLs, just store (v,o) values } } read "</data>"

read "<strls>" for (j=1; j<=N: j++) { for (i=1; i<=K; i++) { if (variable i is strL) { get v and o from data(i,j) if (v==i && o==j) { read GSO up to contents read len bytes of contents store contents in new_dataset(i,j) } else { if (v==0 && o==0) { store "" in new_dataset } else if (o<j || (o==j && v<=i) { retrieve string new_dataset(v,o > ) ... that you previously stored. store string in new_dataset(i,j > ) } else { abort with error due to ... ... forward reference } } } } } read "</strls>"

5.11.6 Advice on reading 6-byte integers

/* The following #defines are the same as in 5.11.4 */

#define Int8 longlong /* or whatever is the type of an 8-byte intege > r */

#define MSF_Start_of_Int6_in_Int8(p) (((char *) (p))+2) #define LSF_Start_of_Int6_in_Int8(p) ((char *) (p))

/* the following code is new */

/* code for byteorder MSF */ void MSF_extract_vo_byteorder(int *v_ptr, Int8 *o_ptr, char s[/*[8]*/]) { unsigned short int i2 ; memcpy(&i2, s, 2) ; *v_ptr = i2 ;

*o_ptr = 0 ; memcpy(MSF_Start_of_Int6_in_Int8(o_ptr), s+2, 6) ; }

/* code for byteorder LSF */ void LSF_extract_vo_byteorder(int *v_ptr, Int8 *o_ptr, char s[/*[8]*/]) { unsigned short int i2 ; memcpy(&i2, s, 2) ; *v_ptr = i2 ;

*o_ptr = 0 ; memcpy(LSF_Start_of_Int6_in_Int8(o_ptr), s+2, 6) ; }

5.12 Value labels

Numeric variables in Stata optionally have value labels associated with them. Value labels map numeric values to strings, such as 1 to "male" and 2 to "female". Mappings are named. The mapping of 1 to "male" and 2 to "female" might be named gender. The recording of the names of the mappings optionally associated with variables was discussed in section 5.7. Variable sex might be associated with value label gender.

Here we discuss the recording of the value label definition itself, such as gender. Even if value label gender is used by a variable, it is not required that the corresponding value-label definition be provided.

Value labels are defined by

<value_labels>individual_definitions</value_labels>

where an individual_definitions are each given by

<lbl>def</lbl>

If no individual definitions are provided, the above becomes

<value_labels></value_labels>

If individual definitions are provided, the above becomes

<value_labels><lbl>def</lbl>...<lbl>def</lbl></value_labels>

where def is

len labelname padding value_label_table | | | | llllcccccccccccccccccccccccccccccccccppp................... |---------- 129 bytes ----------| |--- len bytes ---|

def len format comment ------------------------------------------------------------------- len 4 int length of value_label_table labname 129 char max 32 UTF-8 characters, \0 terminated padding 3 value_label_table len see next table -------------------------------------------------------------------

value_label_table len format comment ---------------------------------------------------------- n 4 int number of entries txtlen 4 int length of txt[] off[] 4*n int array txt[] offset table val[] 4*n int array sorted value table txt[] txtlen char text table ----------------------------------------------------------

len, n, txtlen, off[], and val[] are encoded per byteorder. The maximum byte length of a single label within txt[] is 32,000, or 32,001 bytes, including the terminating binary 0. Stata ignores labels that exceed the limit.

For example, the value_label_table for 1<->yes and 2<->no, shown in MSF format, would be

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 00 00 00 00 04 meaning: n = 2 txtlen = 7 off[0] = 0 off[1] = 4 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 y e s 00 n o 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

The interpretation is that there are n=2 values being mapped. The values being mapped are val[0]=1 and val[1]=2. The corresponding text for val[0] would be at off[0]=0 of txt[] (and so be "yes") and for val[1] would be at off[1]=4 of txt[] (and so be "no").

Interpreting this table in C is not as daunting as it appears. Let (char *) p refer to the memory area into which value_label_table is read. Assume your compiler uses 4-byte ints. The following manifests make interpreting the table easier:

#define SZInt 4 #define Off_n 0 #define Off_nxtoff SZInt #define Off_off (SZInt+SZInt) #define Off_val(n) (SZInt+SZInt+n*SZInt) #define Off_txt(n) (Off_val(n) + n*SZInt) #define Len_table(n,nxtoff) (Off_txt(n) + nxtoff)

#define Ptr_n(p) ( (int *) ( ((char *) p) + Off_n ) ) #define Ptr_nxtoff(p) ( (int *) ( ((char *) p) + Off_nxtoff ) ) #define Ptr_off(p) ( (int *) ( ((char *) p) + Off_off ) ) #define Ptr_val(p,n) ( (int *) ( ((char *) p) + Off_val(n) ) ) #define Ptr_txt(p,n) ( (char *) ( ((char *) p) + Off_txt(n) ) )

It is now the case that for(i=0; i < *Ptr_n(p); i++), the value *Ptr_val(p,i) is mapped to the character string Ptr_txt(p,i).

Remember in allocating memory for *p that the table can be big. The limits are n=65,536 mapped values with each value being up to 32,001 bytes long (including the null terminating byte). There are n offsets and n numeric values in the table, each 4 bytes long. n itself is 4 bytes, and txtlen is 4 bytes. Such a table would be 2,097,741,832 bytes long ((65536 * (32001 + 4 + 4)) + 4 + 4). No user is likely to approach that limit, and in any case, after reading the 8 bytes preceding the table (n and txtlen), you can calculate the remaining length as 2*4*n+txtlen and malloc() the exact amount.

Constructing the table is more difficult. The easiest approach is to set arbitrary limits equal to or smaller than Stata's as to the maximum number of entries and total text length you will allow and simply declare the three pieces off[], val[], and txt[] according to those limits:

int off[MaxValueForN] ; int val[MaxValueForN] ; char txt[MaxValueForTxtlen] ;

Stata's internal code follows a more complicated strategy of always keeping the table in compressed form and having a routine that will "add one position" in the table. This is slower but keeps memory requirements to be no more than the actual size of the table.

In any case, when adding new entries to the table, remember that val[] must be in ascending order: val[0] < val[1] < ... < val[n].

It is not required that off[] or txt[] be kept in ascending order. We previously offered the example of the table that mapped 1<->yes and 2<->no:

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 00 00 00 00 04 meaning: n = 2 txtlen = 7 off[0] = 0 off[1] = 4 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 y e s 00 n o 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

This table could just as well be recorded as

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 03 00 00 00 00 meaning: n = 2 txtlen = 7 off[0] = 3 off[1] = 0 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 01 00 00 00 02 n o 00 y e s 00 meaning: val[0] = 1 val[1] = 2 txt ---> ---------------------------------------------------------------------

but it could not be recorded as

byte position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 07 00 00 00 04 00 00 00 00 meaning: n = 2 txtlen = 7 off[0] = 4 off[1] = 0 ---------------------------------------------------------------------

byte position: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --------------------------------------------------------------------- contents: 00 00 00 02 00 00 00 01 y e s 00 n o 00 meaning: val[0] = 2 val[1] = 1 txt ---> ---------------------------------------------------------------------

It is not the out-of-order values of off[] that cause problems; it is out-of-order values of val[]. In terms of table construction, we find it easier to keep the table sorted as it grows. This way one can use a binary search routine to find the appropriate position in val[] quickly.

The following routine will find the appropriate slot. It uses the manifests we previously defined, and thus it assumes the table is in compressed form, but that is not important. Changing the definitions of the manifests to point to separate areas would be easy enough.

/* slot = vlfindval(char *baseptr, int val)

Looks for value val in label at baseptr. If found: returns slot number: 0, 1, 2, ... If not found: returns k<0 such that val would go in slot -(k+1) k== -1 would go in slot 0. k== -2 would go in slot 1. k== -3 would go in slot 2. */

int vlfindval(char *baseptr, int myval) { int n ; int lb, ub, try ; int *val ; char *txt ; int *off ; int curval ;

n = *Ptr_n(baseptr) ; val = Ptr_val(baseptr, n) ;

if (n==0) return(-1) ; /* not found, insert into 0 */

/* in what follows, */ /* we know result between [lb,ub */ /* or it is not in the table */ lb = 0 ; ub = n - 1 ; while (1) { try = (lb + ub) / 2 ; curval = val[try] ; if (myval == curval) return(try) ; if (myval<curval) { ub = try - 1 ; if (ub<lb) return(-(try+1)) ; /* because want to insert before try, ergo, want to return try, and transform is -(W+1). */ } else /* myval>curval */ { lb = try + 1 ; if (ub<lb) return(-(lb+1)) ; /* because want to insert after try, ergo, want to return try+1 and transform is -(W+1) */ } } /*NOTREACHED*/ }


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index