Re: st: Typo in the .dta description

Wed, 09 Jan 2008 14:15:16 -0600

Sergiy Radyakin <[email protected]> has questions about the coding of Stata's missing values in .dta datasets. These questions concern how Stata records missing values in binary files; not how you use them in Stata. His first question arises becaue of an error in -help dta-. In documenting missing values for byte variables, the help files says byte minimum nonmissing -127 (0x80) <- correct maximum nonmissing +100 (0x64) <- correct code for . +101 (0x66) <- INCORRECT code for .a +102 (0x67) <- INCORRECT code for .b +103 (0x68) <- INCORRECT ... code for .z +127 (0x7f) <- correct This part of the documentation should read byte minimum nonmissing -127 (0x80) maximum nonmissing +100 (0x64) code for . +101 (0x65) code for .a +102 (0x66) code for .b +103 (0x67) ... code for .z +127 (0x7f) The decimal values (2nd column) were correct; the parenthetical translation into hexadecimal was incorrect. Sergiy's second question does not involve an error in documentation. Sergiy writes, > I am also experiencing difficulties with the coding of missing values > for double-precision floating point variables. Can anybody in Stata, > Corp confirm that the codes reported in the file-specifications for > doubles are correct? Or are they swapped between the LoHi and HiLo > formats? (I am referring to the table following the sentence "In any > case, the relevant numbers are"). The numbers in the table are correct; I have just verified them. The numbers are not swapped between HILO and LOHI. The table to which Sergiy refers reads, V value HILO LOHI --------------------------------------------------------------- m -1.fffffffffffffX+3ff ffefffffffffffff ffffffffffffefff M +1.fffffffffffffX+3f3 7fdfffffffffffff ffffffffffffdf7f . +1.0000000000000X+3ff 7fe0000000000000 000000000000e07f .a +1.0010000000000X+3ff 7fe0010000000000 000000000001e07f .b +1.0020000000000X+3ff 7fe0020000000000 000000000002e07f .z +1.01a0000000000X+3ff 7fe01a0000000000 00000000001ae07f m -1.fffffeX+7e feffffff fffffffe M +1.fffffeX+7e 7effffff ffffff7e . +1.000000X+7f 7f000000 0000007f .a +1.001000X+7f 7f000800 0008007f .b +1.002000X+7f 7f001000 0010007f .z +1.01a000X+7f 7f00d000 00d0007f --------------------------------------------------------------- Sergiy writes, > It would be also helpful if the numerical values for the missings were > also reported for floating point types, similarly to the other types > in the previous table. There is simply no way to report these numbers in a finite number of base-10 digits. We could report them rounded to some number of digits, but compilers do not round consistently and using the base-10 rounded number may not result in the exact bit pattern required. One way in the C programming language to store these patterns is Let z be a C double. Assume we want to store bit pattern 7fe01a0000000000, corresponding to a HILO .z in z. We code *(((unsigned char *) &z) + 0) = 0x7f ; *(((unsigned char *) &z) + 1) = 0xe0 ; *(((unsigned char *) &z) + 2) = 0x1a ; *(((unsigned char *) &z) + 3) = 0x00 ; *(((unsigned char *) &z) + 4) = 0x00 ; *(((unsigned char *) &z) + 5) = 0x00 ; *(((unsigned char *) &z) + 6) = 0x00 ; *(((unsigned char *) &z) + 7) = 0x00 ; If we wanted to store bit the reversed bit pattern, corresponding to a LOHI computer, we code, *(((unsigned char *) &z) + 7) = 0x7f ; *(((unsigned char *) &z) + 6) = 0xe0 ; *(((unsigned char *) &z) + 5) = 0x1a ; *(((unsigned char *) &z) + 4) = 0x00 ; *(((unsigned char *) &z) + 3) = 0x00 ; *(((unsigned char *) &z) + 2) = 0x00 ; *(((unsigned char *) &z) + 1) = 0x00 ; *(((unsigned char *) &z) + 0) = 0x00 ; The same approach works with 4-byte floats. Let z4 be a C float. If we want to store bit pattern 7f00d000 in z4, we code *(((unsigned char *) &z4) + 0) = 0x7f ; *(((unsigned char *) &z4) + 1) = 0x00 ; *(((unsigned char *) &z4) + 2) = 0xd0 ; *(((unsigned char *) &z4) + 3) = 0x00 ; To store the reverse pattern, *(((unsigned char *) &z4) + 3) = 0x7f ; *(((unsigned char *) &z4) + 2) = 0x00 ; *(((unsigned char *) &z4) + 1) = 0xd0 ; *(((unsigned char *) &z4) + 0) = 0x00 ; I know there are lots of other ways to achive this result, including the use of unions. I mention the way above simply because it is easy to understand. The above always works. If your C compiler provides 4-byte ints, you can record the missing values for floats easily regardless of byte order, *((int *) &z4) = 0x7f00d000 ; You can do that because C always writes values left to right and, on a HILO computer, 0x7400d000 will be interpreted to mean 0x00d0007f. If your C compiler provides 8-byte long ints, you can also record double floating point numbers easily: *((long int *) &z) = 0x7fe01a0000000000 ; Be careful. The above two statements store results in natural byte order. In addition, for the second statement *((long int *) &z) = 0x7fe01a0000000000 to work, long ints must be 8 bytes long. In Visual C on a 64-bit Windows computer, long ints are still only 32 bits, and you must use the special Microsoft type INT64. My very strong recommendation to Sergiy is that he writes his C code for the natural byte order of the computer, and that he reverse the bytes right at the outset when reading data with a foreign byte order. Finally, Sergiy asks about how weighting variables are set. Jeff Pitblado <[email protected]> will be responding to that. -- Bill [email protected] * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

