|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Typo in the .dta description
Sergiy Radyakin <[email protected]> has questions about the coding of
Stata's missing values in .dta datasets. These questions concern how Stata
records missing values in binary files; not how you use them in Stata.
His first question arises becaue of an error in -help dta-. In documenting
missing values for byte variables, the help files says
byte
minimum nonmissing -127 (0x80) <- correct
maximum nonmissing +100 (0x64) <- correct
code for . +101 (0x66) <- INCORRECT
code for .a +102 (0x67) <- INCORRECT
code for .b +103 (0x68) <- INCORRECT
...
code for .z +127 (0x7f) <- correct
This part of the documentation should read
byte
minimum nonmissing -127 (0x80)
maximum nonmissing +100 (0x64)
code for . +101 (0x65)
code for .a +102 (0x66)
code for .b +103 (0x67)
...
code for .z +127 (0x7f)
The decimal values (2nd column) were correct; the parenthetical translation
into hexadecimal was incorrect.
Sergiy's second question does not involve an error in documentation.
Sergiy writes,
> I am also experiencing difficulties with the coding of missing values
> for double-precision floating point variables. Can anybody in Stata,
> Corp confirm that the codes reported in the file-specifications for
> doubles are correct? Or are they swapped between the LoHi and HiLo
> formats? (I am referring to the table following the sentence "In any
> case, the relevant numbers are").
The numbers in the table are correct; I have just verified them.
The numbers are not swapped between HILO and LOHI. The table to
which Sergiy refers reads,
V value HILO LOHI
---------------------------------------------------------------
m -1.fffffffffffffX+3ff ffefffffffffffff ffffffffffffefff
M +1.fffffffffffffX+3f3 7fdfffffffffffff ffffffffffffdf7f
. +1.0000000000000X+3ff 7fe0000000000000 000000000000e07f
.a +1.0010000000000X+3ff 7fe0010000000000 000000000001e07f
.b +1.0020000000000X+3ff 7fe0020000000000 000000000002e07f
.z +1.01a0000000000X+3ff 7fe01a0000000000 00000000001ae07f
m -1.fffffeX+7e feffffff fffffffe
M +1.fffffeX+7e 7effffff ffffff7e
. +1.000000X+7f 7f000000 0000007f
.a +1.001000X+7f 7f000800 0008007f
.b +1.002000X+7f 7f001000 0010007f
.z +1.01a000X+7f 7f00d000 00d0007f
---------------------------------------------------------------
Sergiy writes,
> It would be also helpful if the numerical values for the missings were
> also reported for floating point types, similarly to the other types
> in the previous table.
There is simply no way to report these numbers in a finite number of
base-10 digits. We could report them rounded to some number of digits,
but compilers do not round consistently and using the base-10 rounded
number may not result in the exact bit pattern required.
One way in the C programming language to store these patterns is
Let z be a C double.
Assume we want to store bit pattern 7fe01a0000000000, corresponding
to a HILO .z in z. We code
*(((unsigned char *) &z) + 0) = 0x7f ;
*(((unsigned char *) &z) + 1) = 0xe0 ;
*(((unsigned char *) &z) + 2) = 0x1a ;
*(((unsigned char *) &z) + 3) = 0x00 ;
*(((unsigned char *) &z) + 4) = 0x00 ;
*(((unsigned char *) &z) + 5) = 0x00 ;
*(((unsigned char *) &z) + 6) = 0x00 ;
*(((unsigned char *) &z) + 7) = 0x00 ;
If we wanted to store bit the reversed bit pattern, corresponding
to a LOHI computer, we code,
*(((unsigned char *) &z) + 7) = 0x7f ;
*(((unsigned char *) &z) + 6) = 0xe0 ;
*(((unsigned char *) &z) + 5) = 0x1a ;
*(((unsigned char *) &z) + 4) = 0x00 ;
*(((unsigned char *) &z) + 3) = 0x00 ;
*(((unsigned char *) &z) + 2) = 0x00 ;
*(((unsigned char *) &z) + 1) = 0x00 ;
*(((unsigned char *) &z) + 0) = 0x00 ;
The same approach works with 4-byte floats.
Let z4 be a C float.
If we want to store bit pattern 7f00d000 in z4, we code
*(((unsigned char *) &z4) + 0) = 0x7f ;
*(((unsigned char *) &z4) + 1) = 0x00 ;
*(((unsigned char *) &z4) + 2) = 0xd0 ;
*(((unsigned char *) &z4) + 3) = 0x00 ;
To store the reverse pattern,
*(((unsigned char *) &z4) + 3) = 0x7f ;
*(((unsigned char *) &z4) + 2) = 0x00 ;
*(((unsigned char *) &z4) + 1) = 0xd0 ;
*(((unsigned char *) &z4) + 0) = 0x00 ;
I know there are lots of other ways to achive this result, including the use
of unions. I mention the way above simply because it is easy to understand.
The above always works. If your C compiler provides 4-byte ints, you can
record the missing values for floats easily regardless of byte order,
*((int *) &z4) = 0x7f00d000 ;
You can do that because C always writes values left to right and, on a
HILO computer, 0x7400d000 will be interpreted to mean 0x00d0007f.
If your C compiler provides 8-byte long ints, you can also record double
floating point numbers easily:
*((long int *) &z) = 0x7fe01a0000000000 ;
Be careful. The above two statements store results in natural byte order.
In addition, for the second statement *((long int *) &z) = 0x7fe01a0000000000
to work, long ints must be 8 bytes long. In Visual C on a 64-bit Windows
computer, long ints are still only 32 bits, and you must use the special
Microsoft type INT64.
My very strong recommendation to Sergiy is that he writes his C code for the
natural byte order of the computer, and that he reverse the bytes right at the
outset when reading data with a foreign byte order.
Finally, Sergiy asks about how weighting variables are set. Jeff
Pitblado <[email protected]> will be responding to that.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/