Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Counter-intuitive overflow behavior

From	"William Gould, StataCorp LP" <[email protected]>
To	[email protected]
Subject	Re: st: Counter-intuitive overflow behavior
Date	Tue, 28 Feb 2012 11:14:32 -0600
Roger Newson <[email protected]> wrote, 

> I have a query about -c(mindouble)- and -c(maxdouble)- (see -help
> creturn- for more about these).  [...]  I have noticed some very
> counter-intuitive behavior when I attempt to assign variables of type
> -double- to values outside the range from -c(mindouble)- to
> -c(maxdouble)-, which (according to -help creturn-) are "the largest
> negative number that can be stored in the 8-byte double storage type"
> and "the largest positive number that can be stored in a double",
> respectively.

Roger shows some exmaples and concludes by writing 

> Is this counterintuitive behavior a bug or a feature? I have always 
> assumed that -c(mindouble)- and -c(maxdouble)- are as advertized, and 
> that we can assume that non-missing -double- values can be expected to 
> be in the closed interval between these 2 limits. This doesn't seem to 
> be the case.

The short answer is, 

    1.  Roger is correct about the counterintuitive behavior.

    2.  It is a feature. 

    3.  Although we advertize the range of doubles as being 
        [c(mindouble), c(maxdouble]), in fact the range is 
        [2*c(mindouble), c(maxdouble)].  We should correct 
        our documentation.

For those who do not want to immerse themselves in the details, 
understand that there is not bug (except for the documentation), and
the counterintuitive behavior will lead to no errors in subsequent
calculation and, in fact, might even make make some calculations come
out as nonmissing (and correct) when, without the counterintuitive
behavior, they would result in missing.  That's not the reason why we
allowed the counterintuitive behavior, however; it was allowed for
performance reasons.

That only leaves the question of why the manual is incorrect.
It was, I hate to admit, intentional in a misguided attempt to avoid
giving this long explanation of how it is that that the range of
negative numbers is greater than the range of positive numbers.  The
attempt worked for years, but ultimately failed. 

We will fix the manual.


What Roger found 
----------------

    1.  Roger discovered that he could store numbers below c(mindouble)
        and that they work just fine.

    2.  Roger then worked exceedingly hard to create numbers above
        c(maxdouble) in, in some cases, he succeeded!  These numbers
        sometimes worked, but mostly, coverted themselves back to
        missing, or showed that they contained a mssing value that is not
        documented in the manual such as .z_, when the official missing
        values are ., .a, .b, ..., .z.

Concerning (1), we will fix the manual. 

Concerning (2), a longer explanation is required.

We need to step back to the time we at StataCorp designed how missing
values would work.



IEEE 754-2008 Foating-point (double) numbers
--------------------------------------------

Floating point numbers are stored as 

            a * 2^b

Think of a as normalized to be in the range -2 < a < 2, although I 
oversimplify in unimportant ways. 

The expontent b, a mathematical integer, is allowed to be in the range 
[-1023, +1023].

The numnber line looks like this:


 -2^1024                          0                             2^1024
    |                             |                               |
     +-----------------------------------------------------------+


By the way, this is a rather odd number line in that it has a finite
number of points on it because a is recorded to a predetermined, fixed
number of digits.  Moreover, the number of points on the number line 
for each power of 2 is the same as for any other power of two. 

IEEE 754-2008 also provides some additional, off the number line
values, the most well known of which is NaN, meaning Not a Number.  It
would seem natural that missing values in Stata would be implemented
in terms of NaN but, for historical reasons, they are not.  Stata
understands NaNs, and uses NaNs internally, but for storage purposes,
NaNs are not used in the encoding of missing values.

This is because IEEE standard 754 has been under revision since 2000;
IEEE 754-2008 is the most recent revision, made in 2008.  IEEE 754 was
first adopted in 1985, the year Stata was first released.  In fact,
Stata was completed in 1984 and, even in 1985, it was not clear that
IEEE 754 would catch on.  There was a competing IBM floating-point
standard and, for example, Microsoft's then popular Basic used
the IBM standard rather than IEEE 754.

Stata uses IEEE 754 today, at least internally.  Nevertheless, Stata
has its own definition of missing values, and not just for historical
reasons.  Missing values are stored in datasets, datasets get old, and
users expect Stata to be able to pick up and use any dataset from the
past.  Thus, we at StataCorp control how Stata stores missing values.
Said differently, NaNs are fleeting things that arise during
calculations and that the software must deal with.  Missing values are
permanent things that need a stable definition and encoding.


How Stata stores missing values 
-------------------------------

All Stata users know that missing values are larger than any
nonmissing value, which is to say, x<. if x is not missing.  Stata
also provides other missing values, namely .a, .b, ..., .z, and they
are ordered .a < .b < ... < .z.

We arranged for all that to be true by taking an entire power of 2 for 
missing values, namely 2^1023.  The number line now looks like this:

     
 -2^1024                          0                             2^1024 IEEE
    |                             |                               |
     +---------------------------------------------------+-------+
    |                             |                      |       |
 -2^1024                          0                    2^1023   2^1024 Stata
                                                         |-------|
                                                          missing 
                                                          values

Although we advertise only 27 missing values, there are lots 
more, some 2^52 of them!  The standard missing values we advertise are

              .  =  1           * 2^1023
              .a =  1 +  1/4096 * 2^1023
              .b =  1 +  2/4096 * 2^1023
                 .
                 .
              .y = 1/ + 25/4096 * 2^1023
              .z = 1/ + 26/4096 * 2^1023

The missing valuies between . and .a are known collectively as ._, 
between .a and .b as .a_, ..., and above .z and .z_. 

Roger ran into that when he showed the example, 

>       . tabulate doc, missing
>
>                doc |      Freq.     Percent        Cum.
>       ------------+-----------------------------------
>                .z_ |         74      100.00      100.00
>       ------------+-----------------------------------
>              Total |         74      100.00

which, Roger wrote, suggested "that the variable -doc- contains a missing
value that I haven't seen mentioned in -help missing-".

Roger also discovered that he could not use .z_ as an input value.
That is because there is no single value associated with .z_.  .z_ is just
an output notation to refer to all the missing values above .z.


A difficult decision at StataCorp
---------------------------------

We at StataCorp, having decided how to encode missing values, now faced 
the nubmer line 


 -2^1024                          0                             2^1024 IEEE
    |                             |                               |
     +---------------------------------------------------+-------+
    |                             |                      |       |
 -2^1024                          0                    2^1023   2^1024 Stata
                                                         |-------|
                                                          missing 
                                                          values

The range of values in Stata is (-2^1024, 2^1023).  There are more 
negative than positive values.  We then considered the idea of 
cutting the number line off at the bottom, too:


 -2^1024                          0                             2^1024 IEEE
    |                             |                               |
     +---------------------------------------------------+-------+
    |        |                    |                      |       |
 -2^1024  -2^1023                 0                    2^1023   2^1024 Stata
    |  CUT   |                                           |-------|
    |        |                                           |missing
    |        |                                           |values
              +-----------------------------------------+
                  We considered making this the range


We considered cutting the number line at -2^1023 just so that the range 
of negative numbers would be the same as the range for positive numbers. 

While there is a pleasing symmetry in the range (-2^1023, 2^1023),
there was simply no reason to throw away those numbers (which means to
map them to missing).  There might someday be a calculation that
visited that part of the number line.  The CPU (actually FPU) was
perfectly capable of making accurate calculations in that range.
Cutting the line off would only result in the odd calculation
resulting in missing when it when it could perfectly well produce a
result.

So we left the range as (-2^1024, 2^1023). 

And then we documented the range as (-2^1023, 2^1023) just so I wouldn't 
have to write this long explanation. 

And I got to write it anyway.



Resolution
-----------

We will change c(mindouble) to be the true smallest double, but we will 
not do that immediately.  The change will need to be made under version 
control.  We will fix the manuals, but not immediately.  Because we have 
gone years without anyone noticing, waiting a little longer should not 
matter.  We will wait until the next release.  And then we'll have to 
turn this Statalist posting into a FAQ so that we will have something 
to refer to. 


-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Counter-intuitive overflow behavior
  - From: "Roger B. Newson" <[email protected]>
Prev by Date: Re: st: RE: New package -wridit- on SSC
Next by Date: RE: st: RE: New package -wridit- on SSC
Previous by thread: st: Counter-intuitive overflow behavior
Next by thread: Re: st: Counter-intuitive overflow behavior
Index(es):
- Date
- Thread