Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: a question about number precision


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: a question about number precision
Date   Mon, 03 Apr 2006 12:12:51 -0500

Jian Zhang <jzh@ucdavis.edu> asked, 

> I have a problem about number precision.  [...]
>
> Here is the data:
>               ID
>         21557127 
>
> then i run the following do file trying to extract the last three digits
> from the ID:
> 
>        . gen double temxxx=(ID/1000)
>        . gen temyyy=int(temxxx)
>        . gen temzzz=temxxx-temyyy
>        . gen areaxxx=(temzzz*1000)
>        . drop temxxx temyyy temzzz
>
> the generated data looks like the following:
>
>            ID           areaxxx
>      21557127               127
>
> However, when I typed: list if areaxxx==127, stata in fact listed nothing!

There have already been offered alternative solutions to the problem, but none
of those posting have answered the question, what did Jian Zhang do wrong?

To answer that question, let's do the following:

        . set obs 1                             // setup problem
        . gen double ID = 21557127
        . list

        . gen double temxxx = (ID/1000)         // What Jian Zhang did
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = temzzz*1000

        . list                                  // Examine result
        . display %16.0g temzzz
        . display %16.0g areaxxx


Here's the result:

        . set obs 1 
        obs was 0, now 1

        . gen double ID = 21557127

        . list

             +----------+
             |       ID |
             |----------|
          1. | 21557127 |
             +----------+

        . 
        . gen double temxxx = (ID/1000)

        . gen temyyy = int(temxxx)

        . gen temzzz = temxxx - temyyy

        . gen areaxxx = temzzz*1000

        . 
        . list

             +--------------------------------------------------+
             |       ID      temxxx   temyyy   temzzz   areaxxx |
             |--------------------------------------------------|
          1. | 21557127   21557.127    21557     .127       127 |
             +--------------------------------------------------+

        . display %16.0g temzzz
         .12700000405312

        . display %16.0g areaxxx
         127.00000762939

The desired result -- 127 -- looked fine until we examined it in detail, 
and then we discovered that the result was in fact 127.00000762939!

That explain why Jian Zhang reported, "when I typed: list if areaxxx==127,
stata in fact listed nothing!"

areaxxx != 127 because tempzzz!=.127.  tempzzz equals .12700000405312.

I have said this before:  Programs like Stata store results in binary, and 
to the right of the decimal point, there is often not an exact equivalent
between decimal and binary given a finite number of digits.  For .5 there is
an exact equivalent:  .1 base 2.  For .25 there is an exact equivalent:  .01
base 2.  For .125 there is an exact equivalent:  .001 base 2.

Understand how to read the above.  To the right of the binary point the 
powers are 2^(-1), 2^(-2), and so on.  .1 base 2 is 1*2^(-1) = 1/2.  
.01 base 2 is 1*2^(-2) = 1/4.  .11 base 2 would be 1/2+1/4 = 3/4.  

There are lots and lots of numbers < 1 for which there is an exact binary
representation.  What is important to understand, however, is that just
because there is an exact representation in one base does imply there is an
exact representation in another.  Think of the number 1/3.  In base 10, it is
.3333333... and that requires an infinite number of digits.  In base 3,
however, 1/3 is .1 base 3.

For the decimal number .127, there is no exact binary equivalent in a finite
number of digits.  The closest a double-precision binary can computer can 
come is 

    A = .0010000010000011000100100110111010010111100011010101100 base 2

and that is very close, being off by less than  2^(-55), or 2.776e-17.

It is not important in and of itself that .127 has no exact binary 
representation.  Pretend you speak base 10 and that I speak base 2, and 
we put a translator between us

               You (base 10)  <-->  TRANSLATOR <--> Me (base 2)

You ask me if .127 is equal to .127.  The TRANSLATOR changes your question to
whether A (.0010000... base 2, see above) equal to A.  I reply that it is, and
you hear YES.  

For lots of problems, that is how the process works.  The question is changed 
a little, but it does not matter.  That is why you have never thought 
about this problem.

There are other questions, however, for which the translation makes a
difference in the answer, and that happens because the TRANSLATOR translates
the question to a finite (and fixed) number of digits, and it happens when 
the base 10 number has no exact binary representation.

Ask the question whether 125.127-125 is equal to .127.  The answer will be NO.

Let's pretend the translator translates to ten digits.
125.127 has no exact representation, but to ten digits, it is 

         1111101.001

.127 has no exact representation, but to ten digits, it is

         .0010000010

So the translated question becomes

     Is 1111101.001 - 1111101 equal to .0010000010?

Let's perform the subtraction:

     Is .001 equal to equal to .0010000010?

That is what happened to Jian Zhang.

Jian Zhang wrote

        . gen double temxxx = (ID/1000)
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = temzzz*1000

but here is what he should have written:

        . gen double temxxx = (ID/1000)
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = round(temzzz*1000)    <-- this line changed

In temzzz, Jian Zhang had a number < 1.  Moreover, he knew that it contained
3 decimal digits, so multiplying it by 1000 should produce an integer.
However, Jian Zhang needed to remember that temzzz contained what amounted
to 3 decimal digits stored in binary.  Think in binary.  Multiplying temzzz by
1000 resulted in an integer + fraction.  Zhang needed to convert that 
result back to the closest integer.

Here's better way Jian Zhang could have written the calculation:

       . gen leftpart = int(ID/1000)
       . gen areaxxx = ID - leftpart*1000

For those of us use to facing this problem, we make sure we always cast 
our solutions in terms of integers.  Regardless of base, the definition of 
integers is the same, and thus the issue of rounding results because of 
base conversion never arises.  THAT IS AN IMPORTANT GENERAL PRINCIPLE.

Let's try my formula:  I start with 21557127.  If I divide it by 1000 and take
the integer part, I have 21557.  If I multiply that by 1000, I have 21557000.
Finally, if I subtract, I have the desired 127 result.

The nature of that calculation does not change because of base conversion.
The key part of the calculation that ensured the calculation would be 
independent of base was taking the integer part.  Never did I hold important
information, or even look, to the right of the decimal point.  Or the binary
point, or the base-16 point.  Whatever fractional part there was, I 
immediately threw it away.

The rule is, "Make extraction calculations using integers."  

-- Bill
wgould@stata.com

P.S.  Is .127 in binary 
      .0010000010000011000100100110111010010111100011010101100 base 2?
      Or at least close to it?

      I think so, but I admit I did not check my work.  

<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index