Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Cut function


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   st: RE: Cut function
Date   Fri, 10 Dec 2010 18:43:24 +0000

This was discussed some while back on the list, namely in 2002. 

It's not officially considered a bug: quite the converse, it is an intended consequence. 

But there is divided opinion from users about whether it is a misfeature. 

See for example http://www.stata.com/statalist/archive/2002-08/msg00151.html for a program author's view. 

My own opinion is that 

1. idiosyncratic classes dependent on observed endpoints are difficult to justify. 

2. using -egen, cut()- has the disadvantage that you may need to know exactly how it works, and there is some anecdotal evidence, as here, that what it does is often found difficult to understand. 

If I bin, I do it to classes defined by -floor()- or -ceil()- which then 

1. are defined by a single line of Stata code 

2. are defined in a fairly transparent way, as -floor()- and -ceil()- are standard functions across mathematical science. 

3. have nice round limits (a secondary but often desirable feature).  

An intersecting issue is that there is likely to be some loss of precision in storing values in locals. Testing for equality with non-integers is _always_ precarious in any case, for reasons often discussed on this list. But what you get out of a local may not be what you put in! 

Nick 
[email protected] 

Albert Lee

just want to see if this has happened to anyone else, and if stata has
an explanation.  I was trying to bin a continuous variable into fixed
intervals.  According to stata documentation, this function

egen price_incrB=cut(price), at(`min'(`step')`max') icodes
(1 missing value generated)

For some reasons, one missing value is created.  It turns out that the max
is not covered by any increment from the egen cut function.  Is this due to
rounding?  Is this a bug?

I would appreciate anyone's insight.

Code fragments are included below.

. sysuse auto, clear
(1978 Automobile Data)

.
. sum price, d

                        Price
-------------------------------------------------------------
  Percentiles      Smallest
1%         3291           3291
5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                    Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188

.
. local max=r(max)

.
. local min=r(min)

.
. local step=(`max'-`min')/9

.
. disp `step'
1401.6667

.
. egen price_incrB=cut(price), at(`min'(`step')`max') icodes
(1 missing value generated)

.
. tab price_incrB, mi

price_incrB |      Freq.     Percent        Cum.
------------+-----------------------------------
      0 |         30       40.54       40.54
      1 |         21       28.38       68.92
      2 |          8       10.81       79.73
      3 |          3        4.05       83.78
      4 |          2        2.70       86.49
      5 |          4        5.41       91.89
      6 |          2        2.70       94.59
      7 |          3        4.05       98.65
      . |          1        1.35      100.00
------------+-----------------------------------
  Total |         74      100.00

.
. list price if price_incrB==.

 +--------+
 |  price |
 |--------|
13. | 15,906 |
 +--------+


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index