[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: are there any statistics rules that I can apply to separate numbers into groups? |

Date |
Wed, 11 Mar 2009 11:21:52 -0000 |

<> . findit group1d points to a program in this territory on SSC. Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps. Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful: Ian S. Evans. 1977. The Selection of Class Intervals. Transactions of the Institute of British Geographers 2: 98-124. It will be accessible to many readers (but not all) through JSTOR. I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks". The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here. -group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative without ever implementing it. Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values. I see 49 values in her example. Reading those in . sort var1 . group1d var1, max(7) Partitions of 49 data up to 7 groups 1 group: sum of squares 8604.04 Group Size First Last Mean SD 1 49 1 92.2135 49 144.228 112.04 13.25 2 groups: sum of squares 3059.49 Group Size First Last Mean SD 2 33 17 108.97 49 144.228 119.44 9.03 1 16 1 92.2135 16 107.54 96.76 4.80 3 groups: sum of squares 1073.41 Group Size First Last Mean SD 3 10 40 124.565 49 144.228 130.97 6.70 2 24 16 107.54 39 121.712 114.14 3.97 1 15 1 92.2135 15 104.744 96.04 4.04 4 groups: sum of squares 524.92 Group Size First Last Mean SD 4 4 46 133.568 49 144.228 138.39 4.33 3 14 32 116.865 45 127.885 122.00 3.79 2 19 13 102.857 31 115.641 110.48 3.51 1 12 1 92.2135 12 95.5293 94.09 1.11 5 groups: sum of squares 309.08 Group Size First Last Mean SD 5 4 46 133.568 49 144.228 138.39 4.33 4 9 37 120.04 45 127.885 124.33 2.61 3 14 23 112.013 36 119.072 114.95 2.37 2 10 13 102.857 22 111.134 107.88 2.82 1 12 1 92.2135 12 95.5293 94.09 1.11 6 groups: sum of squares 185.01 Group Size First Last Mean SD 6 4 46 133.568 49 144.228 138.39 4.33 5 6 40 124.565 45 127.885 126.03 1.15 4 9 31 115.641 39 121.712 118.61 1.91 3 14 17 108.97 30 114.67 111.74 1.74 2 4 13 102.857 16 107.54 104.77 1.73 1 12 1 92.2135 12 95.5293 94.09 1.11 7 groups: sum of squares 116.89 Group Size First Last Mean SD 7 2 48 140.798 49 144.228 142.51 1.72 6 2 46 133.568 47 134.952 134.26 0.69 5 6 40 124.565 45 127.885 126.03 1.15 4 9 31 115.641 39 121.712 118.61 1.91 3 14 17 108.97 30 114.67 111.74 1.74 2 4 13 102.857 16 107.54 104.77 1.73 1 12 1 92.2135 12 95.5293 94.09 1.11 Groups Sums of squares 1 8604.04 2 3059.49 3 1073.41 4 524.92 5 309.08 6 185.01 7 116.89 It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here. . qplot var1, rank xli(12.5 22.5 36.5 45.5) The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing. Nick n.j.cox@durham.ac.uk Ada Ma Thank you to both Partha Deb and Kyle Hood for providing me with some very promising looking leads to attempt. On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <kyle.hood@yale.edu> wrote: > In mapping, univariate classification schemes are used to group features > together. An example is Jenks' natural breaks, which simply defines k-1 > cutoffs to minimize within-group sums of square deviations from group means. > Unfortunately, > > . findit jenks > > produces nothing. However, there is information on the web regarding how to > compute these cutoffs (just google it). I'm not sure how closely this > method relates to cluster analysis and finite mixture models. Partha Deb wrote: >> Although one can never be sure what's in someone else's mind, I suspect >> you are looking for cluster analysis. -help cluster- . Finite mixture >> models may also be of interest. -findit fmm- . See >> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by >> Stephen Fisher that has an introduction to Cluster analysis and finite >> mixture models. Ada Ma wrote: >>> Let's say I have 50 packets of crisps of various weights and I would >>> like to separate these 50 packets of crisps into five groups based on >>> their weights in grams, as follows: >>> >>> 108.9702 >>> 111.1337 >>> 112.5217 >>> 112.6697 >>> 112.9962 >>> 114.0323 >>> 114.6699 >>> 116.8646 >>> 119.0719 >>> 124.5645 >>> 124.691 >>> 126.4943 >>> 126.5528 >>> 133.5675 >>> 134.9519 >>> 140.7979 >>> 144.228 >>> 102.8566 >>> 103.9373 >>> 104.7436 >>> 107.5397 >>> 109.4443 >>> 109.7089 >>> 110.395 >>> 112.1248 >>> 113.6032 >>> 115.6405 >>> 117.1919 >>> 120.0395 >>> 121.0714 >>> 121.7119 >>> 110.1116 >>> 112.0128 >>> 117.6563 >>> 118.2418 >>> 126.0027 >>> 127.8855 >>> 92.21352 >>> 92.45715 >>> 92.953 >>> 93.01508 >>> 94.05335 >>> 94.27259 >>> 94.38242 >>> 94.72507 >>> 94.83315 >>> 95.25914 >>> 95.37813 >>> 95.52933 >>> >>> I don't want to separate them into five equally sized groups. I want >>> to separate the packets into groups so that the group members are most >>> similar to one another. I am looking for a method (or methods?) to >>> achieve this end but I don't know where to start. If you can think of >>> any suggestion please fire away and I'd be most grateful! >>> * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Ada Ma <heu034@googlemail.com>

**References**:**st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Ada Ma <heu034@googlemail.com>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Partha Deb <partha.deb@hunter.cuny.edu>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*"Kyle K. Hood" <kyle.hood@yale.edu>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Ada Ma <heu034@googlemail.com>

- Prev by Date:
**Re: AW: st: help with a histogram by year** - Next by Date:
**Re: st: implementation of variance formula** - Previous by thread:
**Re: st: are there any statistics rules that I can apply to separate numbers into groups?** - Next by thread:
**Re: st: are there any statistics rules that I can apply to separate numbers into groups?** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |