[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: gsort issue

From   Fred Wolfe <>
To, <>
Subject   RE: st: gsort issue
Date   Thu, 05 Jul 2007 18:48:06 -0500

I am not unhappy with Stata considering missing to be the highest value. I was interested only in a specially sorted data set: missings, then ascending. I was interested in the greatest non-missing value of x. So a sort of x with missing values up front would leave the last value (_N) as the greatest value. While there are many ways to obtain this value, I was running an iterated series of -expand- and it was useful to me to have missings out of the way and knowing which observation # was _N before I expanded.

I remember (from the old days) that -gsort- didn't set the sort order. But sometime ago it was changed, or was it? gsort help says that it "... differs from sort in that sort produces ascending-order arrangements only."

However, the following code produces ?unexpected results:

. sort age duration
. local sortorder : sort
. di "sort order is `sortorder'"
sort order is age duration

. gsort -age duration
. local sortorder : sort
. di "sort order is `sortorder'"
sort order is

. gsort age duration
. local sortorder : sort
. di "sort order is `sortorder'"
sort order is age duration

. gsort -age duration,mfirst
. local sortorder : sort
. di "sort order is `sortorder'"
sort order is

So it seems that gsort sets the sort order only in ascending sorts. Either that or the reported output is wrong.

Perhaps an emendation to the help file could limm the issue.

At 05:45 PM 7/5/2007, Nick Cox wrote:

There are intersecting issues here on several different

Let's start with the obvious.

0. You want Stata to be smart enough to ignore missings as
irrelevant when they are so, and you don't usually notice,
and don't usually complain, when that works as designed.

1. Missing values have to go somewhere when you -sort- the
data. There is no case for "in the middle"; it must be
one or other end, above the very highest or below the
very lowest. When you -sort- the data, observations with
missing values can't just hover in some philosophical mystery
zone; they must _go_ somewhere.

2. You have to decide what to do with missings when
you use inequalities. This is really the same issue as #1.

Sometime in the year 0, meaning 1985, StataCorp, or more
precisely CRC, plumped for their choice:

3. Stata chooses the high end. Numeric missings are arbitrarily
large. Rumour, or history, or Bill Gould, says that, mostly,
he had been irritated too many times by ploughing through
ordered lists from Some Alternative Software that started
with values he didn't care about. The other way, you can
stop reading when it stops being interesting.

Really, not much has changed since 1985 and those of us
still here and still using Stata in say 2029 will not,
I guess, be discussing anything different.

For a start, #3 has been the rule for so long and is embedded
in so many habits and so much code that changing any of
it is a recipe for mayhem and madness.

I don't think Tom's scheme has a chance of lift-off.
I gather he wants -if- to change behaviour, or x > 2 to change
meaning, and either makes me feel really queasy. There is perhaps
a little more chance of new functions, say

gt(x, 2) meaning x > 2 & x < .

ge(x, 2) meaning x >= 2 & x < .

but I am not sure they would actually be used much even if they
were introduced.

If this thread continues long enough, someone will
suggest some kind of three-way logic in which missings
are not high or low but just different. Tom in a way has
perhaps done that already. David Kantor
gave a talk on three-way logic at one Boston meeting
and the discussion was fast and furious. As I recall,
the audience who spoke divided into three (surprise):
those who were clear that three-way logic was a bad idea;
those who wanted some kind of three-way logic, but
definitely not David's; and those who liked David's
scheme. David himself didn't seem to like his own scheme
much the more he thought about it. And it didn't improve from there.

I gave the following talk, on directional data, and was able
to explain my subject as one of circular arguments.

I gather that StataCorp have batted this back and
forth internally, but got no further despite many
discussions than the idea that three-way logic would
solve a few problems but make things much worse for
most users, especially those in the first decade of
their Stata experience.

As a more detailed footnote, -inrange()- has been
around for a while and already offers one kind of solution.

-inrange(x, 42, .)- means "x >= 42 & x < .".

In fact, this is, in essence, a generalisation of -ge()-
above. My impression is that it hasn't caught on much,
which rather weakens any case for new functions.


Steichen, Thomas J.

> Isn't the simplest solution that missing should never be treated in
> Stata code as a "number"?
> Thus, things like sorts would need a documented definition of where
> missing go but we wouldn't have to work around the "numeric" missing
> so often.
> For example,
>   replace x = 3 if x > 2 & != .
> becomes the much simpler
>   replace x = 3 if x > 2
> I wonder how often I've messed up analyses because I forgot to tag
> on the "& != ." ?
> (Hmmmmm, the "& != ." kind of looks like cartoon-speak for what I
> usually say when I notice I've failed to add the tag!)
> Clearly, I'd much rather have Stata's code deal with this than for me
> to remember all the time, even if there is a processing overhead.

*   For searches and help try:

Fred Wolfe
National Data Bank for Rheumatic Diseases
Wichita, Kansas
Tel +1 316 263 2125

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index