[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: How SHOULD cut behave |

Date |
Fri, 9 Aug 2002 15:00:26 +0100 |

Michael Hills > After the flurry of crossing posts on this topic, finally put to bed > by Bill Gould's very clear reply, perhaps it is worth airing the > question of how cut SHOULD behave. > > In the original version, the result of tabulating newvar after > > egen newvar=cut(oldvar),at(25(5)45) > > was > > 25- > 30- > 35- > 40- > > and as I understand it, the complaint was that the numbers > 25,30,35,40,45 are described as left hand end-points so > that strictly > the output of tabulate should be > > 25- > 30- > 35- > 40- > 45- > > in which the last group contains all non-missing values of var. I > confess that I don't like this, as I would have to exclude 45- from > all following work with newvar. Also, why is there not a <25 group, > which you might expect if there is a >45 group? > > Perhaps (as I think Jens suggested) the output > > 25- > 30- > 35- > 40-45 > > would satisfy all parties. Only observations in [25,45) are included > and 45 is a not-included right-hand end. All observations outside > [25,45) are coded as missing on newvar. > > The output > > [25-30) > [30-35) > [35-40) > [40-45) > > would be even better, but the mathematician's convention > that [25-30) > includes 25 but not 30 is not recognized in medicine and probably > not in economics either. > Michael's question is, rightly, how SHOULD -cut()- behave? My starting point is not any version of the code, but the present documentation at [R] egen, p.412. That's the most accessible discussion, although incomplete. I like his last list of labels, using [,) notation, even as a lower form of life than those mentioned, a geographer. But back-tracking, I haven't used -egen, cut()- very much, but when I did I found it confusing that as implemented the argument of -at()- consists of left-hand ends, except for the last element, which is a right-hand end. Almost always, if I want to classify a continuous variable I want to classify all values, it being understood that missings are ignored (unless there is some reason to treat them specially, for which I will make up my own rule, and use it explicitly). If I don't want to classify all values, there is a standard Stata way of expressing that desire, using -if-. I exclude any interval not desired using -if-. To fix ideas, consider -mpg- in the auto data, which varies from 12 to 41. If with the old behaviour of -egen, cut()- I go egen cmpg = cut(mpg), at(10(5)40) I find that the top value of 41 is mapped to missing. I can subvert that by deliberately giving an upper limit which is never used, egen cmpg = cut(mpg), at(10(5)45) which reminds me of all those movies in which the protagonist can get past the guard dogs unscathed -- because an extra piece of meat has been taken along for the purpose. That is, -cut()- grabs the 45, and uses the next lower limit, so that 41 is mapped to 40 as I originally intended. To put it another way, if I specify left-hand ends of intervals, then by far the simplest way of implementing that, it seems to me, is that the highest such left-hand end specified is (if needed) the start of an open interval containing that value and all higher (again, not including missing). There is the question of what happens below the lowest left-hand end. If I go egen cmpg = cut(mpg), at(15(5)45) values below 15 are also mapped to missing. The attitude is tough: "You didn't say what you wanted doing with values below 15, so I don't know what to do with them.". This could be regarded as fit punishment, or a suitable reply, that is, the user got what was asked for, no more, no less. On the other hand, I think it would be friendlier behaviour to get another open interval, up to but not including 15. I'm more worried by what -cut()- did at the top end, but I think there is a strong case for consistent behaviour. To put it more broadly, if I told someone to classify a variable using the cutpoints 15(5)40, I wouldn't want them to ignore away values outside the range specified. That should happen only if I spell that out. It should not be the default. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How SHOULD cut behave***From:*Michael Hills <mhills@blueyonder.co.uk>

- Prev by Date:
**st: RE: Tables showing ordered lists [was: New question]** - Next by Date:
**Re: st: drop exact name only** - Previous by thread:
**st: How SHOULD cut behave** - Next by thread:
**st: Calculating elasticity using MFX** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |