Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Missing Values in Matrices


From   wgould@stata.com (William Gould)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Missing Values in Matrices
Date   Wed, 25 Sep 2002 11:00:12 -0500

Roger Newson <roger.newson@kcl.ac.uk>, to a reply I just wrote, asked

> What precisely is a NaN? And does it have any connection with missing 
> values, or with the "magic number" 1e300 mentioned in -[R] tabstat-? I 
> can't find any reference to NaNs in -[R] matrix define-.

Excuse me for using jargon.  NaN stands for "Not a Number" as defined 
by the IEE Standard for Binary Floating Point Arithmetic ANSI/IEE 8754-1985.
That standard defines how the coprocessor on your computer works.

NaN is a way of encoding missing values, but it is not the one that Stata 
uses.  Were we reimplementing Stata from scratch today, we would probably 
adopt the NaN standard.  This all has to do with the way Stata internally
works and you would not care one way or the other, but it probably would 
simplify our life here a little at StataCorp.

When Stata was first implement, this IEEE standard was still not widely
accepted, so we developed our own.  You need to think back to the time before
coprocessors.  The C compiler we used to compile Stata used the IEEE standard
but Microsoft's BASIC, for instance, used IBM's COMP-3 standard which was in
wide use because it was used on the then-popular System/370.

With the introduction of the Intel floating-point coprocessors, the IEEE
standard did catch on, but the early implementations really did not follow it
very carefully.  Intel's chips followed the standard, but most people did not
have coprocessors and instead software was used to emulate the behavior of the
chip.  Imulate is a poor choice of words here.  Aped was more like it.

All of that is cleaned up now, but even as recently as a three or four years
ago I remember struggling to deal with different interpretations of the
"standard", which is clear enough on what a NaN is but not how it is to be 
used.  The problem arose in Stata's behavior across platforms.  Remember, we
support Windows, Macintosh, and Unix (lots of them), and there is not
agreement among them on what should happen, for instance, when you divide by
zero.  There is an old tradition called the "exception", which means the
computer crashes.  Exceptions, however, can be intercepted and you can avoid
the crash.  Then there is the modern idea of a NaN:  divide by zero and you
get a NaN, not a crash.  In the early days, even after adoptation of the
IEEE standard, lots of computers continued to yield exceptions on errors
rather than NaNs. 

I use divide by zero just as an example, but I think the problem was 
actually exp(x) for x too small, which can lead to an "Underflow exception"
(old tradition) or NaN (modern tradition).  It does not matter; the point is
that we some computers yielding exceptions and others yielding a NaNs, this
time because the more "modern" computers were switching over to the NaN idea.

So now there is lots of code inside Stata protecting it from both traditions.
Whether an exception arises or a NaN, as quickly as Stata can, it maps the
result to its own concept of a missing value and handles it from there.

How consistent are things nowadays?  I really do not know.  The protection
code inside Stata prevents me from knowing because it so effectively covers up
the problem.

Now you know more than you ever wanted.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index