Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "William Gould, StataCorp LP" <wgould@stata.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Counter-intuitive overflow behavior |
Date | Tue, 28 Feb 2012 11:14:32 -0600 |
Roger Newson <r.newson@imperial.ac.uk> wrote, > I have a query about -c(mindouble)- and -c(maxdouble)- (see -help > creturn- for more about these). [...] I have noticed some very > counter-intuitive behavior when I attempt to assign variables of type > -double- to values outside the range from -c(mindouble)- to > -c(maxdouble)-, which (according to -help creturn-) are "the largest > negative number that can be stored in the 8-byte double storage type" > and "the largest positive number that can be stored in a double", > respectively. Roger shows some exmaples and concludes by writing > Is this counterintuitive behavior a bug or a feature? I have always > assumed that -c(mindouble)- and -c(maxdouble)- are as advertized, and > that we can assume that non-missing -double- values can be expected to > be in the closed interval between these 2 limits. This doesn't seem to > be the case. The short answer is, 1. Roger is correct about the counterintuitive behavior. 2. It is a feature. 3. Although we advertize the range of doubles as being [c(mindouble), c(maxdouble]), in fact the range is [2*c(mindouble), c(maxdouble)]. We should correct our documentation. For those who do not want to immerse themselves in the details, understand that there is not bug (except for the documentation), and the counterintuitive behavior will lead to no errors in subsequent calculation and, in fact, might even make make some calculations come out as nonmissing (and correct) when, without the counterintuitive behavior, they would result in missing. That's not the reason why we allowed the counterintuitive behavior, however; it was allowed for performance reasons. That only leaves the question of why the manual is incorrect. It was, I hate to admit, intentional in a misguided attempt to avoid giving this long explanation of how it is that that the range of negative numbers is greater than the range of positive numbers. The attempt worked for years, but ultimately failed. We will fix the manual. What Roger found ---------------- 1. Roger discovered that he could store numbers below c(mindouble) and that they work just fine. 2. Roger then worked exceedingly hard to create numbers above c(maxdouble) in, in some cases, he succeeded! These numbers sometimes worked, but mostly, coverted themselves back to missing, or showed that they contained a mssing value that is not documented in the manual such as .z_, when the official missing values are ., .a, .b, ..., .z. Concerning (1), we will fix the manual. Concerning (2), a longer explanation is required. We need to step back to the time we at StataCorp designed how missing values would work. IEEE 754-2008 Foating-point (double) numbers -------------------------------------------- Floating point numbers are stored as a * 2^b Think of a as normalized to be in the range -2 < a < 2, although I oversimplify in unimportant ways. The expontent b, a mathematical integer, is allowed to be in the range [-1023, +1023]. The numnber line looks like this: -2^1024 0 2^1024 | | | +-----------------------------------------------------------+ By the way, this is a rather odd number line in that it has a finite number of points on it because a is recorded to a predetermined, fixed number of digits. Moreover, the number of points on the number line for each power of 2 is the same as for any other power of two. IEEE 754-2008 also provides some additional, off the number line values, the most well known of which is NaN, meaning Not a Number. It would seem natural that missing values in Stata would be implemented in terms of NaN but, for historical reasons, they are not. Stata understands NaNs, and uses NaNs internally, but for storage purposes, NaNs are not used in the encoding of missing values. This is because IEEE standard 754 has been under revision since 2000; IEEE 754-2008 is the most recent revision, made in 2008. IEEE 754 was first adopted in 1985, the year Stata was first released. In fact, Stata was completed in 1984 and, even in 1985, it was not clear that IEEE 754 would catch on. There was a competing IBM floating-point standard and, for example, Microsoft's then popular Basic used the IBM standard rather than IEEE 754. Stata uses IEEE 754 today, at least internally. Nevertheless, Stata has its own definition of missing values, and not just for historical reasons. Missing values are stored in datasets, datasets get old, and users expect Stata to be able to pick up and use any dataset from the past. Thus, we at StataCorp control how Stata stores missing values. Said differently, NaNs are fleeting things that arise during calculations and that the software must deal with. Missing values are permanent things that need a stable definition and encoding. How Stata stores missing values ------------------------------- All Stata users know that missing values are larger than any nonmissing value, which is to say, x<. if x is not missing. Stata also provides other missing values, namely .a, .b, ..., .z, and they are ordered .a < .b < ... < .z. We arranged for all that to be true by taking an entire power of 2 for missing values, namely 2^1023. The number line now looks like this: -2^1024 0 2^1024 IEEE | | | +---------------------------------------------------+-------+ | | | | -2^1024 0 2^1023 2^1024 Stata |-------| missing values Although we advertise only 27 missing values, there are lots more, some 2^52 of them! The standard missing values we advertise are . = 1 * 2^1023 .a = 1 + 1/4096 * 2^1023 .b = 1 + 2/4096 * 2^1023 . . .y = 1/ + 25/4096 * 2^1023 .z = 1/ + 26/4096 * 2^1023 The missing valuies between . and .a are known collectively as ._, between .a and .b as .a_, ..., and above .z and .z_. Roger ran into that when he showed the example, > . tabulate doc, missing > > doc | Freq. Percent Cum. > ------------+----------------------------------- > .z_ | 74 100.00 100.00 > ------------+----------------------------------- > Total | 74 100.00 which, Roger wrote, suggested "that the variable -doc- contains a missing value that I haven't seen mentioned in -help missing-". Roger also discovered that he could not use .z_ as an input value. That is because there is no single value associated with .z_. .z_ is just an output notation to refer to all the missing values above .z. A difficult decision at StataCorp --------------------------------- We at StataCorp, having decided how to encode missing values, now faced the nubmer line -2^1024 0 2^1024 IEEE | | | +---------------------------------------------------+-------+ | | | | -2^1024 0 2^1023 2^1024 Stata |-------| missing values The range of values in Stata is (-2^1024, 2^1023). There are more negative than positive values. We then considered the idea of cutting the number line off at the bottom, too: -2^1024 0 2^1024 IEEE | | | +---------------------------------------------------+-------+ | | | | | -2^1024 -2^1023 0 2^1023 2^1024 Stata | CUT | |-------| | | |missing | | |values +-----------------------------------------+ We considered making this the range We considered cutting the number line at -2^1023 just so that the range of negative numbers would be the same as the range for positive numbers. While there is a pleasing symmetry in the range (-2^1023, 2^1023), there was simply no reason to throw away those numbers (which means to map them to missing). There might someday be a calculation that visited that part of the number line. The CPU (actually FPU) was perfectly capable of making accurate calculations in that range. Cutting the line off would only result in the odd calculation resulting in missing when it when it could perfectly well produce a result. So we left the range as (-2^1024, 2^1023). And then we documented the range as (-2^1023, 2^1023) just so I wouldn't have to write this long explanation. And I got to write it anyway. Resolution ----------- We will change c(mindouble) to be the true smallest double, but we will not do that immediately. The change will need to be made under version control. We will fix the manuals, but not immediately. Because we have gone years without anyone noticing, waiting a little longer should not matter. We will wait until the next release. And then we'll have to turn this Statalist posting into a FAQ so that we will have something to refer to. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/