[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Typo in the .dta description

From	[email protected] (William Gould, StataCorp LP)
To	[email protected]
Subject	Re: st: Typo in the .dta description
Date	Wed, 09 Jan 2008 14:15:16 -0600
Sergiy Radyakin <[email protected]> has questions about the coding of
Stata's missing values in .dta datasets.  These questions concern how Stata
records missing values in binary files; not how you use them in Stata.

His first question arises becaue of an error in -help dta-.  In documenting
missing values for byte variables, the help files says

          byte
              minimum nonmissing        -127 (0x80)    <- correct
              maximum nonmissing        +100 (0x64)    <- correct
              code for .                +101 (0x66)    <- INCORRECT
              code for .a               +102 (0x67)    <- INCORRECT
              code for .b               +103 (0x68)    <- INCORRECT
              ...
              code for .z               +127 (0x7f)    <- correct

This part of the documentation  should read
          
          byte
              minimum nonmissing        -127 (0x80)
              maximum nonmissing        +100 (0x64)
              code for .                +101 (0x65)
              code for .a               +102 (0x66)
              code for .b               +103 (0x67)
              ...
              code for .z               +127 (0x7f)

The decimal values (2nd column) were correct; the parenthetical translation 
into hexadecimal was incorrect.

Sergiy's second question does not involve an error in documentation.
Sergiy writes, 

> I am also experiencing difficulties with the coding of missing values
> for double-precision floating point variables. Can anybody in Stata,
> Corp confirm that the codes reported in the file-specifications for
> doubles are correct? Or are they swapped between the LoHi and HiLo
> formats? (I am referring to the table following the sentence "In any
> case, the relevant numbers are").

The numbers in the table are correct; I have just verified them.
The numbers are not swapped between HILO and LOHI.  The table to 
which Sergiy refers reads, 

     V            value                HILO             LOHI
     ---------------------------------------------------------------
     m    -1.fffffffffffffX+3ff   ffefffffffffffff  ffffffffffffefff
     M    +1.fffffffffffffX+3f3   7fdfffffffffffff  ffffffffffffdf7f
     .    +1.0000000000000X+3ff   7fe0000000000000  000000000000e07f
     .a   +1.0010000000000X+3ff   7fe0010000000000  000000000001e07f
     .b   +1.0020000000000X+3ff   7fe0020000000000  000000000002e07f
     .z   +1.01a0000000000X+3ff   7fe01a0000000000  00000000001ae07f

     m    -1.fffffeX+7e           feffffff          fffffffe
     M    +1.fffffeX+7e           7effffff          ffffff7e
     .    +1.000000X+7f           7f000000          0000007f
     .a   +1.001000X+7f           7f000800          0008007f
     .b   +1.002000X+7f           7f001000          0010007f
     .z   +1.01a000X+7f           7f00d000          00d0007f
     ---------------------------------------------------------------

Sergiy writes, 

> It would be also helpful if the numerical values for the missings were
> also reported for floating point types, similarly to the other types
> in the previous table.

There is simply no way to report these numbers in a finite number of 
base-10 digits.  We could report them rounded to some number of digits, 
but compilers do not round consistently and using the base-10 rounded 
number may not result in the exact bit pattern required.

One way in the C programming language to store these patterns is

        Let z be a C double.
        Assume we want to store bit pattern 7fe01a0000000000, corresponding
        to a HILO .z in z.  We code 

               *(((unsigned char *) &z) + 0) = 0x7f ;
               *(((unsigned char *) &z) + 1) = 0xe0 ;
               *(((unsigned char *) &z) + 2) = 0x1a ;
               *(((unsigned char *) &z) + 3) = 0x00 ;
               *(((unsigned char *) &z) + 4) = 0x00 ;
               *(((unsigned char *) &z) + 5) = 0x00 ;
               *(((unsigned char *) &z) + 6) = 0x00 ;
               *(((unsigned char *) &z) + 7) = 0x00 ;

        If we wanted to store bit the reversed bit pattern, corresponding 
        to a LOHI computer, we code,


               *(((unsigned char *) &z) + 7) = 0x7f ;
               *(((unsigned char *) &z) + 6) = 0xe0 ;
               *(((unsigned char *) &z) + 5) = 0x1a ;
               *(((unsigned char *) &z) + 4) = 0x00 ;
               *(((unsigned char *) &z) + 3) = 0x00 ;
               *(((unsigned char *) &z) + 2) = 0x00 ;
               *(((unsigned char *) &z) + 1) = 0x00 ;
               *(((unsigned char *) &z) + 0) = 0x00 ;

       The same approach works with 4-byte floats.

       Let z4 be a C float.
       If we want to store bit pattern 7f00d000 in z4, we code

               *(((unsigned char *) &z4) + 0) = 0x7f ;
               *(((unsigned char *) &z4) + 1) = 0x00 ;
               *(((unsigned char *) &z4) + 2) = 0xd0 ;
               *(((unsigned char *) &z4) + 3) = 0x00 ;

       To store the reverse pattern, 

               *(((unsigned char *) &z4) + 3) = 0x7f ;
               *(((unsigned char *) &z4) + 2) = 0x00 ;
               *(((unsigned char *) &z4) + 1) = 0xd0 ;
               *(((unsigned char *) &z4) + 0) = 0x00 ;

I know there are lots of other ways to achive this result, including the use
of unions.  I mention the way above simply because it is easy to understand.
The above always works.  If your C compiler provides 4-byte ints, you can 
record the missing values for floats easily regardless of byte order, 

               *((int *) &z4) = 0x7f00d000 ;

You can do that because C always writes values left to right and, on a 
HILO computer, 0x7400d000 will be interpreted to mean 0x00d0007f.

If your C compiler provides 8-byte long ints, you can also record double 
floating point numbers easily:

               *((long int *) &z) = 0x7fe01a0000000000 ;

Be careful.  The above two statements store results in natural byte order.  
In addition, for the second statement *((long int *) &z) = 0x7fe01a0000000000
to work, long ints must be 8 bytes long.  In Visual C on a 64-bit Windows 
computer, long ints are still only 32 bits, and you must use the special
Microsoft type INT64.

My very strong recommendation to Sergiy is that he writes his C code for the 
natural byte order of the computer, and that he reverse the bytes right at the
outset when reading data with a foreign byte order.

Finally, Sergiy asks about how weighting variables are set.  Jeff 
Pitblado <[email protected]> will be responding to that. 

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Calling STATA from external programs in a Windows Environment?
Next by Date: Re: st: Typo in the .dta description
Previous by thread: st: Typo in the .dta description
Next by thread: Re: st: Typo in the .dta description
Index(es):
- Date
- Thread