Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: latest update of cut (Policy on handling missing values)


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: latest update of cut (Policy on handling missing values)
Date   Thu, 8 Aug 2002 16:54:23 +0100

Jens M. Lauritsen
>
> Changing behaviour in relation to well known and widely
> used functions is a
> "no-no" in my opinion.
>
> So I am happy to see:
>
> "I've been convinced that this is the better interpretation
> of missing for
> -egen, cut-.  Missing values will be mapped to missing; non
> missing values
> larger than that of the largest value in -at- will map to
> the largest value
> in the -at- list"
>
> The problem of cut is an aspect of this. I used cut in the original
> (MHills) version and
> then decided that I would like to have labels of both ends of the
> intervals, e.g. (1-20)(21-40) not the original cut of
> (1-)(21-) . So I made
> my "own" version cutjl based on MHills cut.
>
> My suggestion when teaching Stata (and other programs) is
> always: Find some
> core functions you know well and stick to them. Disregard
> automatic updates
> for crucial parts until you are dissatisfied with the old
> one. Better
> having an old and somewhat simpler version than a new one
> with uncertainty.
>
> Apparently cut was taken into stata (I did not know) as an
> internal command
> which is reasonable and now with changed behaviour in
> relation to missing
> (and back to original behaviour soon).
>
> I would argue that if at all anything should change of
> internal programs in
> Stata it should be in the other direction. I.e. having more
> functions to
> actually regard missing as missing. The internal storage of
> missing values
> as a very large number is irrelevant to users, in particular to new
> inexperienced colleagues. A technicality like this is part
> of the reason
> why some clinicians consider Stata a very difficult program
> to work with.
> (The second is the lack of quest system to handle all
> situations, and the
> third the difficulty of getting quick from results to
> tables in publications).
>
> So : Stata policy should be: If we can at all implement
> handling of missing
> as such without the user having to write strange sentences
> like (......if
> v1 != .) then do it. Not the other way around - in
> particular not for
> commands or functions which behaved sensibly before.
>
> Other stats programs such as SPSS (and SAS ?? I am not
> sure) can have
> several values defined as missing value. In some situations
> we wish to
> separate missing from irrelevant and this is currently not
> possible with
> the STATA dataformat. So add more than one value to the development
> strategy, please.

In the specific case of -egen, cut()-, a change
which was made (correctly, as I understand the general issue) had
a side-effect that missing values would be mapped to non-missing
values. Users who spoke on this on Statalist (Michael Hills,
Marcello Pagano, and Jens) all argued strongly that _that_ was
a mistake and Jean Marie Linhart signalled that Stata Corp
will fix it. The whole matter was aired and resolved within
a day, and we know that the update will follow quickly:
all sounds very healthy to me. Name software vendors, statistical or
otherwise, who perform comparably.

But Jens airs some more general issues. I think that his principle

> Changing behaviour in relation to well known and widely
> used functions is a "no-no" in my opinion.

is exactly Stata philosophy, so there wouldn't seem much room
for discussion. However, with all the care in the world there
is a need for bug fixes, some of which unfortunately trigger
the need for other bug fixes. Perfectly stable (meaning
unchanged, not reliable) software is only possibly so long as bugs
remain unfixed.

However, the advice he gives

> Disregard automatic updates
> for crucial parts until you are dissatisfied with the old
> one. Better having an old and somewhat simpler version than a new
one
> with uncertainty.

is rather puzzling to me. One can easily disregard official
updates just by not issuing -update-: but the trade-off between
getting bugs fixed and getting new features, as compared with
very occasionally having Stata change to something you don't
like, seems to me absolutely overwhelming. I would always
recommend an -update-. Of course, you can -update-, keep
track of changes by reading help on -whatsnew- and then decide
what you want to use, and what you want to program for yourself.

On missing values: many Stata users have asked for different kinds
of missing values at user meetings and in threads on Statalist,
and I'm sure they would be a major feature. Whether such
different ways of handling missing values would make Stata easier
to understand for "clinicians" and others, I don't know.
Somehow I doubt it.

More generally, we can all agree, I trust, that statistical
software must have a concept of numeric missing, which to
the user means in general "not known". Whatever the internals
of how such missings are treated, users just must understand
whether missing is regarded as "arbitrarily large and positive"
or "arbitrarily large and negative" for at least one purpose:
understanding what happens when you -sort- on a numeric
variable.

Actually for most purposes, missings in Stata are just ignored,
as they should be. Only occasionally are you likely
to be bitten, whenever Stata treats missing
as literally as possible. They are mostly (or perhaps
even entirely) through forgetting that

	if varname > #

includes varname being numeric missing. Thus with the auto data

	if rep78 > 3

includes rep78 being missing. Rightly or wrongly, Stata went for the
inclusive
approach.

Now consider the options, hypothetically:

1. After years of this, Stata Corp might announce this was
wrong. "if rep78 > 3" henceforth will _not_ include observations
for which rep78 is numeric missing. Some users might breathe
a sigh of relief, but there would be others who might argue
that such fundamentals should not be changed like this, even
with (perhaps especially with!) version control or whatever.

2. Some new syntax is introduced which lets you specify
easily "greater than 3, and I don't include missing". (-inrange()-
arguably already lets you do this.) This doesn't break
existing syntax, but it's one extra thing to learn about.

The trade-off between adding features (which everyone wants,
including occasional users) and keeping Stata as simple as
possible (which everyone wants, especially occasional users)
is tricky -- in a strict sense, perhaps even insoluble.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index