[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: a question about number precision

From	Michael McCulloch <[email protected]>
To	[email protected]
Subject	Re: st: a question about number precision
Date	Mon, 03 Apr 2006 10:31:01 -0700

This explanation is very clear.
May I ask, for those of us with lesser experience, what are the command methods to stick to in generating variables, so that the integer result we want isn't missed by Stata because of the binary storage?
Michael McCulloch

At 10:12 AM 4/3/2006, you wrote:

Jian Zhang <[email protected]> asked,

> I have a problem about number precision.  [...]
>
> Here is the data:
>               ID
>         21557127
>
> then i run the following do file trying to extract the last three digits
> from the ID:
>
>        . gen double temxxx=(ID/1000)
>        . gen temyyy=int(temxxx)
>        . gen temzzz=temxxx-temyyy
>        . gen areaxxx=(temzzz*1000)
>        . drop temxxx temyyy temzzz
>
> the generated data looks like the following:
>
>            ID           areaxxx
>      21557127               127
>
> However, when I typed: list if areaxxx==127, stata in fact listed nothing!

There have already been offered alternative solutions to the problem, but none
of those posting have answered the question, what did Jian Zhang do wrong?

To answer that question, let's do the following:

        . set obs 1                             // setup problem
        . gen double ID = 21557127
        . list

        . gen double temxxx = (ID/1000)         // What Jian Zhang did
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = temzzz*1000

        . list                                  // Examine result
        . display %16.0g temzzz
        . display %16.0g areaxxx


Here's the result:

        . set obs 1
        obs was 0, now 1

        . gen double ID = 21557127

        . list

             +----------+
             |       ID |
             |----------|
          1. | 21557127 |
             +----------+

        .
        . gen double temxxx = (ID/1000)

        . gen temyyy = int(temxxx)

        . gen temzzz = temxxx - temyyy

        . gen areaxxx = temzzz*1000

        .
        . list

             +--------------------------------------------------+
             |       ID      temxxx   temyyy   temzzz   areaxxx |
             |--------------------------------------------------|
          1. | 21557127   21557.127    21557     .127       127 |
             +--------------------------------------------------+

        . display %16.0g temzzz
         .12700000405312

        . display %16.0g areaxxx
         127.00000762939

The desired result -- 127 -- looked fine until we examined it in detail,
and then we discovered that the result was in fact 127.00000762939!

That explain why Jian Zhang reported, "when I typed: list if areaxxx==127,
stata in fact listed nothing!"

areaxxx != 127 because tempzzz!=.127.  tempzzz equals .12700000405312.

I have said this before:  Programs like Stata store results in binary, and
to the right of the decimal point, there is often not an exact equivalent
between decimal and binary given a finite number of digits.  For .5 there is
an exact equivalent:  .1 base 2.  For .25 there is an exact equivalent:  .01
base 2.  For .125 there is an exact equivalent:  .001 base 2.

Understand how to read the above.  To the right of the binary point the
powers are 2^(-1), 2^(-2), and so on.  .1 base 2 is 1*2^(-1) = 1/2.
.01 base 2 is 1*2^(-2) = 1/4.  .11 base 2 would be 1/2+1/4 = 3/4.

There are lots and lots of numbers < 1 for which there is an exact binary
representation.  What is important to understand, however, is that just
because there is an exact representation in one base does imply there is an
exact representation in another.  Think of the number 1/3.  In base 10, it is
.3333333... and that requires an infinite number of digits.  In base 3,
however, 1/3 is .1 base 3.

For the decimal number .127, there is no exact binary equivalent in a finite
number of digits.  The closest a double-precision binary can computer can
come is

    A = .0010000010000011000100100110111010010111100011010101100 base 2

and that is very close, being off by less than  2^(-55), or 2.776e-17.

It is not important in and of itself that .127 has no exact binary
representation.  Pretend you speak base 10 and that I speak base 2, and
we put a translator between us

               You (base 10)  <-->  TRANSLATOR <--> Me (base 2)

You ask me if .127 is equal to .127.  The TRANSLATOR changes your question to
whether A (.0010000... base 2, see above) equal to A.  I reply that it is, and
you hear YES.

For lots of problems, that is how the process works.  The question is changed
a little, but it does not matter.  That is why you have never thought
about this problem.

There are other questions, however, for which the translation makes a
difference in the answer, and that happens because the TRANSLATOR translates
the question to a finite (and fixed) number of digits, and it happens when
the base 10 number has no exact binary representation.

Ask the question whether 125.127-125 is equal to .127.  The answer will be NO.

Let's pretend the translator translates to ten digits.
125.127 has no exact representation, but to ten digits, it is

         1111101.001

.127 has no exact representation, but to ten digits, it is

         .0010000010

So the translated question becomes

     Is 1111101.001 - 1111101 equal to .0010000010?

Let's perform the subtraction:

     Is .001 equal to equal to .0010000010?

That is what happened to Jian Zhang.

Jian Zhang wrote

        . gen double temxxx = (ID/1000)
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = temzzz*1000

but here is what he should have written:

        . gen double temxxx = (ID/1000)
        . gen temyyy = int(temxxx)
        . gen temzzz = temxxx - temyyy
        . gen areaxxx = round(temzzz*1000)    <-- this line changed

In temzzz, Jian Zhang had a number < 1.  Moreover, he knew that it contained
3 decimal digits, so multiplying it by 1000 should produce an integer.
However, Jian Zhang needed to remember that temzzz contained what amounted
to 3 decimal digits stored in binary.  Think in binary.  Multiplying temzzz by
1000 resulted in an integer + fraction.  Zhang needed to convert that
result back to the closest integer.

Here's better way Jian Zhang could have written the calculation:

       . gen leftpart = int(ID/1000)
       . gen areaxxx = ID - leftpart*1000

For those of us use to facing this problem, we make sure we always cast
our solutions in terms of integers.  Regardless of base, the definition of
integers is the same, and thus the issue of rounding results because of
base conversion never arises.  THAT IS AN IMPORTANT GENERAL PRINCIPLE.

Let's try my formula:  I start with 21557127.  If I divide it by 1000 and take
the integer part, I have 21557.  If I multiply that by 1000, I have 21557000.
Finally, if I subtract, I have the desired 127 result.

The nature of that calculation does not change because of base conversion.
The key part of the calculation that ensured the calculation would be
independent of base was taking the integer part.  Never did I hold important
information, or even look, to the right of the decimal point.  Or the binary
point, or the base-16 point.  Whatever fractional part there was, I
immediately threw it away.

The rule is, "Make extraction calculations using integers."

-- Bill
[email protected]

P.S.  Is .127 in binary
      .0010000010000011000100100110111010010111100011010101100 base 2?
      Or at least close to it?

      I think so, but I admit I did not check my work.

<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


Best wishes,
Michael


____________________________________

Michael McCulloch
Pine Street Clinic
124 Pine Street, San Anselmo, CA 94960-2674
tel     415.407.1357
fax     415.485.1065
email:  [email protected]
web:    www.pinest.org
        www.pinestreetfoundation.org





*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Show formats in -display-?
  - From: "Danielle H. Ferry" <[email protected]>

References:
- Re: st: a question about number precision
  - From: [email protected] (William Gould, Stata)

Prev by Date: Re: st: a question about number precision
Next by Date: st: My Stata wishlist
Previous by thread: Re: st: a question about number precision
Next by thread: st: Show formats in -display-?
Index(es):
- Date
- Thread