There has recently been a lengthy thread on Statalist about missing values. I
am the one responsible for missing being encoded as larger than any other
numeric value, and I too have typed
. gen rich = (income>100000)
and been bitten. So has every other developer at StataCorp.
It seems obvious that we should fix the problem, so why don't we?
It is not for reasons of backward compatibility. We know how to solve
that, as everyone on the list well knows.
Some years ago, the last time the issue of missing value and the number line
arose on Statalist, we gave serious consideration to doing just that. Jereon
Weesie was visiting StataCorp at the time and made a strong argument for doing
so. We rediscovered what we already knew, namely that the problem of missing
values and the number line are inherent and, if they do not pop up in one
place, they pop up in another.
Joseph Coveney <[email protected]> mentioned that problem, without going
into details, when he wrote,
> SQL does this in its three-valued logic implementation. [...] It should be
> noted, however, this implementation results in its own gotcha for
> programmers. [...] (SQL has to implement a special syntax to test for
> NULLness in the data. [...])
Exactly. Like Joseph, I'm not going to delve into the details and, honestly,
I do not even recall all of them right now. We set about designing a
solution. The problem was, missing values held surprises and if they did not
pop up in one place, they would pop up in another. After finishing the
design, we came to the conclusion that users would not like it. Actually,
that's not a fair way of stating it. Some users would have liked it, and
other users would not, and it was a judgment call as to which was the larger
group. The group that would have hated it would be those involved in data
management. That, plus compatibility, lead us to abandon the effort.
In this effort, Jereon Weesie, an outsider, was a true believer. Those
who argue for a change were well represented. The fact is, after days of
work, Jereon changed his mind, too. A recent joke on the list about Bill
Rising, formerly of Louisville and a strong proponent of change, suddenly
changing his mind after walking through the door at StataCorp has some
truth behind it.
Jeroen almost signed on to the change. What pushed him over the edge was that
he also wanted multiple missing values (.a, .b, ...), and when he saw what
they did to the design, he realized it was one or the other, and he wanted
multiple missing values more.
The change would have made data management less elegant, but statements
like
. gen rich = (income>100000)
more in line with expectations. These days, with the popularity of Multiple
Missing Value Imputation becoming more and more popular, I am more convinced
than ever that we made the right decision. The observations containing
missing values need to be easy to identify and classify, and they are as
things are right now.
I also want to disagree with Jeph Herrin <[email protected]> when he wrote,
> Logically, a missing value should return FALSE when compared with any
> number. Period. (ahem). That StataCorp has chosen to jettison this logic
> [...]
My disagreement is not with Jeph's point, but that he overstates it. I will
illustrate by overstating a counter argument.
Logically, the law of the excluded middle -- that the formula P or
not-P is true and is a defining property of classical systems -- has
been with us since Aristotle, and we take it for granted, programmers and
computer users especially. That Jeph is so willing to jettison this
logic [...]
In my earlier days, I was very intolerant of arguments for a three-valued
logic, and I was convinced logic was on my side. More recently I have become
more understanding of the argument on the other side. I simply want to warn
Jeph and others that there is no intellectually costless solution to this
problem, and logic is not on their side or on mine.
Philosophically, I am unwilling to give up on P or not-P being true because it
is so deeply ingrained in programmers and how we write programs.
Practically, making statements like -gen rich = (income>100000)- produce what
we are thinking is impossible when some would say FALSE and others would
say MISSING. It is possible to create a "consistent" system, given a loose
definition of consistent, but it adds complication that merely pushes the
problem from one expressed intention to another, and trades off one set of
complaints for another.
Anyway, that's how we made the decision years ago to leave things as they
are. Nothing has changed.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/