Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: How SHOULD cut behave

From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: How SHOULD cut behave
Date   Fri, 9 Aug 2002 15:00:26 +0100

Michael Hills

> After the flurry of crossing posts on this topic, finally put to bed
> by Bill Gould's very clear reply, perhaps it is worth airing the
> question of how cut SHOULD behave.
> In the original version, the result of tabulating newvar after
> egen newvar=cut(oldvar),at(25(5)45)
> was
> 25-
> 30-
> 35-
> 40-
> and as I understand it, the complaint was that the numbers
> 25,30,35,40,45 are described as left hand end-points so
> that strictly
> the output of tabulate should be
> 25-
> 30-
> 35-
> 40-
> 45-
> in which the last group contains all non-missing values of var. I
> confess that I don't like this, as I would have to exclude 45- from
> all following work with newvar. Also, why is there not a <25 group,
> which you might expect if there is a >45 group?
> Perhaps (as I think Jens suggested) the output
> 25-
> 30-
> 35-
> 40-45
> would satisfy all parties. Only observations in [25,45) are included
> and 45 is a not-included right-hand end. All observations outside
> [25,45) are coded as missing on newvar.
> The output
> [25-30)
> [30-35)
> [35-40)
> [40-45)
> would be even better, but the mathematician's convention
> that [25-30)
> includes 25 but not 30 is not recognized in medicine and probably
> not in economics either.

Michael's question is, rightly, how SHOULD -cut()- behave?
My starting point is not any version of the code, but the present
documentation at [R] egen, p.412. That's the most accessible
discussion, although incomplete.

I like his last list of labels, using [,) notation, even as a
lower form of life than those mentioned, a geographer.

But back-tracking, I haven't used -egen, cut()- very much, but when
I did I found it confusing that as implemented the argument of
-at()- consists of left-hand ends, except for the last element,
which is a right-hand end.

Almost always, if I want to classify a continuous variable
I want to classify all values, it being understood that missings
are ignored (unless there is some reason to treat them specially,
for which I will make up my own rule, and use it explicitly).

If I don't want to classify all values, there is a standard Stata
way of expressing that desire, using -if-. I exclude any interval
not desired using -if-.

To fix ideas, consider -mpg- in the auto data, which varies
from 12 to 41.

If with the old behaviour of -egen, cut()- I go

egen cmpg = cut(mpg), at(10(5)40)

I find that the top value of 41 is mapped to missing. I can
subvert that by deliberately giving an upper limit which
is never used,

egen cmpg = cut(mpg), at(10(5)45)

which reminds me of all those movies in which the protagonist
can get past the guard dogs unscathed -- because an extra piece of
has been taken along for the purpose. That is, -cut()-
grabs the 45, and uses the next lower limit, so that 41 is mapped
to 40 as I originally intended.

To put it another way, if I specify left-hand ends of intervals,
then by far the simplest way of implementing that, it seems to me,
is that the highest such left-hand end specified is (if needed) the
of an open interval containing that value and all higher (again,
not including missing).

There is the question of what happens below the lowest left-hand end.

If I go

egen cmpg = cut(mpg), at(15(5)45)

values below 15 are also mapped to missing. The attitude is
tough: "You didn't say what you wanted doing with values
below 15, so I don't know what to do with them.". This could be
regarded as fit punishment, or a suitable reply, that is,
the user got what was asked for, no more, no less.

On the other hand, I think it would be friendlier behaviour
to get another open interval, up to but not including

I'm more worried by what -cut()- did at the top end, but
I think there is a strong case for consistent behaviour.

To put it more broadly, if I told someone to classify
a variable using the cutpoints 15(5)40, I wouldn't want them
to ignore away values outside the range specified. That
should happen only if I spell that out. It should not
be the default.

[email protected]

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index