# RE: st: are there any statistics rules that I can apply to separate numbers into groups?

 From "Nick Cox" To Subject RE: st: are there any statistics rules that I can apply to separate numbers into groups? Date Wed, 11 Mar 2009 11:21:52 -0000

<>

. findit group1d

points to a program in this territory on SSC.

Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps.

Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful:

Ian S. Evans. 1977.
The Selection of Class Intervals.
Transactions of the Institute of British Geographers 2: 98-124.

It will be accessible to many readers (but not all) through JSTOR.

I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks".

The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here.

-group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative  without ever implementing it.

Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values.

I see 49 values in her example. Reading those in

. sort var1

. group1d var1, max(7)

Partitions of 49 data up to 7 groups

1 group:  sum of squares 8604.04
Group Size    First            Last           Mean      SD
1      49    1  92.2135      49  144.228   112.04   13.25

2 groups: sum of squares 3059.49
Group Size    First            Last           Mean      SD
2      33   17   108.97      49  144.228   119.44    9.03
1      16    1  92.2135      16   107.54    96.76    4.80

3 groups: sum of squares 1073.41
Group Size    First            Last           Mean      SD
3      10   40  124.565      49  144.228   130.97    6.70
2      24   16   107.54      39  121.712   114.14    3.97
1      15    1  92.2135      15  104.744    96.04    4.04

4 groups: sum of squares 524.92
Group Size    First            Last           Mean      SD
4       4   46  133.568      49  144.228   138.39    4.33
3      14   32  116.865      45  127.885   122.00    3.79
2      19   13  102.857      31  115.641   110.48    3.51
1      12    1  92.2135      12  95.5293    94.09    1.11

5 groups: sum of squares 309.08
Group Size    First            Last           Mean      SD
5       4   46  133.568      49  144.228   138.39    4.33
4       9   37   120.04      45  127.885   124.33    2.61
3      14   23  112.013      36  119.072   114.95    2.37
2      10   13  102.857      22  111.134   107.88    2.82
1      12    1  92.2135      12  95.5293    94.09    1.11

6 groups: sum of squares 185.01
Group Size    First            Last           Mean      SD
6       4   46  133.568      49  144.228   138.39    4.33
5       6   40  124.565      45  127.885   126.03    1.15
4       9   31  115.641      39  121.712   118.61    1.91
3      14   17   108.97      30   114.67   111.74    1.74
2       4   13  102.857      16   107.54   104.77    1.73
1      12    1  92.2135      12  95.5293    94.09    1.11

7 groups: sum of squares 116.89
Group Size    First            Last           Mean      SD
7       2   48  140.798      49  144.228   142.51    1.72
6       2   46  133.568      47  134.952   134.26    0.69
5       6   40  124.565      45  127.885   126.03    1.15
4       9   31  115.641      39  121.712   118.61    1.91
3      14   17   108.97      30   114.67   111.74    1.74
2       4   13  102.857      16   107.54   104.77    1.73
1      12    1  92.2135      12  95.5293    94.09    1.11

Groups     Sums of squares
1         8604.04
2         3059.49
3         1073.41
4          524.92
5          309.08
6          185.01
7          116.89

It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here.

. qplot var1, rank xli(12.5 22.5 36.5 45.5)

The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing.

Nick
n.j.cox@durham.ac.uk

Thank you to both Partha Deb and Kyle Hood for providing me with some
very promising looking leads to attempt.

On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <kyle.hood@yale.edu> wrote:

> In mapping, univariate classification schemes are used to group features
> together.  An example is Jenks' natural breaks, which simply defines k-1
> cutoffs to minimize within-group sums of square deviations from group means.
>  Unfortunately,
>
> . findit jenks
>
> produces nothing.  However, there is information on the web regarding how to
> compute these cutoffs (just google it).  I'm not sure how closely this
> method relates to cluster analysis and finite mixture models.

Partha Deb wrote:

>> Although one can never be sure what's in someone else's mind, I suspect
>> you are looking for cluster analysis. -help cluster- .  Finite mixture
>> models may also be of interest. -findit fmm- .  See
>> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by
>> Stephen Fisher that has an introduction to Cluster analysis and finite
>> mixture models.

>>> Let's say I have 50 packets of crisps of various weights and I would
>>> like to separate these 50 packets of crisps into five groups based on
>>> their weights in grams, as follows:
>>>
>>> 108.9702
>>> 111.1337
>>> 112.5217
>>> 112.6697
>>> 112.9962
>>> 114.0323
>>> 114.6699
>>> 116.8646
>>> 119.0719
>>> 124.5645
>>> 124.691
>>> 126.4943
>>> 126.5528
>>> 133.5675
>>> 134.9519
>>> 140.7979
>>> 144.228
>>> 102.8566
>>> 103.9373
>>> 104.7436
>>> 107.5397
>>> 109.4443
>>> 109.7089
>>> 110.395
>>> 112.1248
>>> 113.6032
>>> 115.6405
>>> 117.1919
>>> 120.0395
>>> 121.0714
>>> 121.7119
>>> 110.1116
>>> 112.0128
>>> 117.6563
>>> 118.2418
>>> 126.0027
>>> 127.8855
>>> 92.21352
>>> 92.45715
>>> 92.953
>>> 93.01508
>>> 94.05335
>>> 94.27259
>>> 94.38242
>>> 94.72507
>>> 94.83315
>>> 95.25914
>>> 95.37813
>>> 95.52933
>>>
>>> I don't want to separate them into five equally sized groups.  I want
>>> to separate the packets into groups so that the group members are most
>>> similar to one another.  I am looking for a method (or methods?) to
>>> achieve this end but I don't know where to start.  If you can think of
>>> any suggestion please fire away and I'd be most grateful!
>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/