[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: gsort issue

From   "Nick Cox" <>
To   <>
Subject   RE: st: gsort issue
Date   Thu, 5 Jul 2007 23:45:05 +0100

There are intersecting issues here on several different 

Let's start with the obvious. 

0. You want Stata to be smart enough to ignore missings as 
irrelevant when they are so, and you don't usually notice, 
and don't usually complain, when that works as designed. 

1. Missing values have to go somewhere when you -sort- the
data. There is no case for "in the middle"; it must be 
one or other end, above the very highest or below the 
very lowest. When you -sort- the data, observations with 
missing values can't just hover in some philosophical mystery 
zone; they must _go_ somewhere. 

2. You have to decide what to do with missings when 
you use inequalities. This is really the same issue as #1. 

Sometime in the year 0, meaning 1985, StataCorp, or more
precisely CRC, plumped for their choice: 

3. Stata chooses the high end. Numeric missings are arbitrarily 
large. Rumour, or history, or Bill Gould, says that, mostly, 
he had been irritated too many times by ploughing through 
ordered lists from Some Alternative Software that started 
with values he didn't care about. The other way, you can 
stop reading when it stops being interesting. 

Really, not much has changed since 1985 and those of us
still here and still using Stata in say 2029 will not, 
I guess, be discussing anything different. 

For a start, #3 has been the rule for so long and is embedded
in so many habits and so much code that changing any of 
it is a recipe for mayhem and madness. 

I don't think Tom's scheme has a chance of lift-off. 
I gather he wants -if- to change behaviour, or x > 2 to change
meaning, and either makes me feel really queasy. There is perhaps 
a little more chance of new functions, say 

gt(x, 2) meaning x > 2 & x < . 

ge(x, 2) meaning x >= 2 & x < . 

but I am not sure they would actually be used much even if they
were introduced. 

If this thread continues long enough, someone will
suggest some kind of three-way logic in which missings 
are not high or low but just different. Tom in a way has
perhaps done that already. David Kantor
gave a talk on three-way logic at one Boston meeting 
and the discussion was fast and furious. As I recall, 
the audience who spoke divided into three (surprise): 
those who were clear that three-way logic was a bad idea; 
those who wanted some kind of three-way logic, but 
definitely not David's; and those who liked David's 
scheme. David himself didn't seem to like his own scheme 
much the more he thought about it. And it didn't improve from there. 

I gave the following talk, on directional data, and was able
to explain my subject as one of circular arguments. 

I gather that StataCorp have batted this back and
forth internally, but got no further despite many 
discussions than the idea that three-way logic would
solve a few problems but make things much worse for 
most users, especially those in the first decade of 
their Stata experience. 

As a more detailed footnote, -inrange()- has been 
around for a while and already offers one kind of solution. 

-inrange(x, 42, .)- means "x >= 42 & x < .". 

In fact, this is, in essence, a generalisation of -ge()-
above. My impression is that it hasn't caught on much, 
which rather weakens any case for new functions. 


Steichen, Thomas J.
> Isn't the simplest solution that missing should never be treated in
> Stata code as a "number"? 
> Thus, things like sorts would need a documented definition of where
> missing go but we wouldn't have to work around the "numeric" missing
> so often. 
> For example, 
>   replace x = 3 if x > 2 & != . 
> becomes the much simpler
>   replace x = 3 if x > 2
> I wonder how often I've messed up analyses because I forgot to tag
> on the "& != ." ? 
> (Hmmmmm, the "& != ." kind of looks like cartoon-speak for what I 
> usually say when I notice I've failed to add the tag!)
> Clearly, I'd much rather have Stata's code deal with this than for me 
> to remember all the time, even if there is a processing overhead.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index