Tom Steichen replied to me. My original is prefixed >, his replies are
plain, and my latest comments are blocked off with NJC and horizontal
Lines.
>The first step in Stata's position is that missing numeric
>values must themselves be assigned a numeric value. This
>follows from the fact than when -sort-ed on a numeric
>value, observations with missings must go somewhere.
>The second step is then over what is to be done with missings given
>an inequality > or <.
>But, seriously, what are the alternatives?
I will take, at face value, Nick's question about alternatives
and see if I can provide answers.
To do so, I'll start with Nick's argument above that "numeric values
must themselves be assigned a numeric value. This follows from the
fact than when -sort-ed on a numeric value, observations with
missings must go somewhere."
It seems evident that missing values could be ignored in a sort
then _arbitrarily_ placed either first or last, without consideration
of their "assigned" numeric value. Therefore the need to consider
any "assigned" value for missing as a legitimate, sortable "numeric"
value is unnecessary. While I agree that missing need to be assigned
a value for storage purposes, I can't agree that that value should
fit into the ordered, numeric system. Clearly, if I knew it was a
big (or small) value, then it wouldn't truly be missing.
NJC
===================================================================
I don't see what point Tom is making here. This all sounds like a
distinction without a difference. Numeric missing
_must_ have a numeric value for representation within the machine.
Whether that is explicit in documentation or known to users is
another matter. In addition, the extended missings .a to .z
must also have numeric representations for -sort-ing to be possible
at all. Clearly, there is no homunculus in the machine saying, "Oh, this
is
missing, so I'll just put it at the end". Conversely, I know there is
a numeric value yet despite having re-read the documentation many times
I can never remember what it is. That is typical of the user situation:
All you need to know is that the value is higher than anything else.
If that is (part of) Tom's point, then I agree.
===================================================================
>Allan and Tom
>seem to want "if x > 42" to ignore missings on -x-. If that
>were so, then it would solve one problem only to replace
>it with at least four others, on quite different levels:
>1. Stata is now inconsistent. Missings are assigned precise
>numeric values for some purposes (e.g. -sort-ing) but not others.
Given my argument above, a precise numeric value is not need
for sorting, so there is no inconsistency.
>2. What, under that proposal, would be the truth value of
>an expression, say . > 42
Let me rewrite this as "local y = ." then ask what truth value I
would assign to `y' > 42. Answer: undefined (i.e., missing).
Same as local x = `y'*42. Again since I don't know the value
of y, why should I know the truth value of a statement about y
or any other operation on it?
Interpreting the truth statement this way would be highly
consistent with the way Stata handles missing elsewhere.
NJC
===================================================================
OK, so Tom seems to be asking for a three-way logic in which
logical conditions can be true, false, or missing. As others have
pointed out, that is possible and other software does something like
it. Briefly, it doesn't appeal to me.
I am reminded vividly of a talk that David Kantor gave at the first
Boston meeting in 2001. Tom was there too and David is active on this
list. Anyway, David talked about implementation of a three-way logic
in Stata through -egen- functions. The discussion was fast and furious
and seemed to fall three ways (yes indeed): those who liked David's
scheme, those who thought a three-way logic was needed, but disagreed
with David on what it should it be, and those who thought that
Stata's existing two-way logic, despite problems, is nevertheless
greatly
preferable overall.
To pick up a point made later in this thread, three-way logic means that
you need truth tables for logical operations involving &, | and !, and
users,
including introductory users, must be exposed to all that. Also,
many, many comparisons involving (say) -if- or -if/else- need to
be revisited just in case the pertinent operand could be missing.
===================================================================
>3. Designing a language according to what users are supposed to mean,
>rather than what they say, is, in my experience, a very long, very
>slippery slope to perdition.
Designing a language that ignores users' normal logical
interpretation is also, quote: "a very long, very slippery slope
to perdition." I doubt that most people would expect a logical
operation on an unknown value to have a known truth value (other
than an operation which asks whether the value is known). It could
well be that StataCorp's choice to subject users to this particular
trap is a reason the package is not chosen for use by more users.
I know I deal with missing values in most of my datasets;
unfortunately I cannot say with certainty that I have not blown
some analyses because of this implementation choice. I hope not
but I'm also glad my management has never asked me to justify
using a package that has this known trap.
NJC
=======================================================================
We're brandishing impressions and prejudices at each other. But
I can't see that the situation is anywhere near as bad as Tom
seems to fear. If missings are present, then almost always they
would not actually be included in modelling, summary statistics
or graphs even if you accidentally request that they be included
by virtue of a condition such as
... if x > 42
What's most evident is a data management request in which you
get missings shown when you didn't want them, but there's no
tragedy there, just an irritation.
======================================================================
>4. Declaring this behaviour now to be a bug, or at least a misfeature,
>would be a major change in Stata. Goodness knows how many scripts,
programs
>and understandings would be broken by such a change, even under version
>control.
I suspect StataCorp's programmers could minimize the impact.
However, even if they could not implement the change without
breaking some of my code, I'd rather pay the price of rewriting
code than continue risking a bad analysis. Sometimes the price
of a greater good is some immediate pain. While the instances
of Stata changes breaking (properly implemented) user code
is rare, it has occurred and we survived.
I fully understand StataCorp's right to implement Stata in
the way they believe best, but I also see value in changing the
current implementation of comparisons to missing. I believe there
would be value to me as a user (avoidance of bad analyses) and to
StataCorp (removing an implementation choice that some (many?)
users believe is wrong and that, perhaps, could be a reason for
not choosing Stata as their statistical package).
NJC
=======================================================================
Similarly, we're sharing gut impressions. Tom and I are both
user-programmers with several years' experience of Stata, but absolutely
no knowledge of the internals of Stata. Only StataCorp can speak on
this. My guess remains that what Tom is asking is a very big deal for a
highly
dubious net payoff and I can't see StataCorp budging even a smidgen
on this point.
=======================================================================
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/