[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Ada Ma <heu034@googlemail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: are there any statistics rules that I can apply to separate numbers into groups? |

Date |
Thu, 12 Mar 2009 10:20:41 +0000 |

Thanks to Nick for introducing me to this wonderful command -group1d-. It's exactly what I was looking for. I have some further questions - which I hope someone would help me to understand. I was also playing around with the -cluster kmeans- command and find that -group1d- generates the same groupings -cluster kmeans- with the option -measure(L2squared)- applied. I then compare the results of -cluster kmeans- with or without the -measure(L2squared)- option specified. The result groupings are different. I don't really understand why this should be the case for univariate clustering, because when I typed: help measure_option (note the underscore between the words measure and option, without the underscore a different help file will show up) It is explained that the default option calculates the grouping by minimising: requests the Euclidean distance / Minkowski distance metric with argument 2 sqrt(sum((x_ia - x_ja)^2)) But when the option -measure(L2squared)- is specified grouping is assigned by minimising the square of the Euclidean distance / Minkowski distance metric with argument 2 sum((x_ia - x_ja)^2) Here are some output generated using the same 49 observations: . cluster kmeans var1, k(4) generate(euclid) cluster name: _clus_5 . cluster kmeans var1, k(4) generate(euclidsq) measure(L2squared) cluster name: _clus_1 . tab euclid euclidsq | euclidsq euclid | 1 2 3 4 | Total -----------+--------------------------------------------+---------- 1 | 10 0 0 0 | 10 2 | 0 0 12 0 | 12 3 | 0 4 0 6 | 10 4 | 9 0 0 8 | 17 -----------+--------------------------------------------+---------- Total | 19 4 12 14 | 49 . bys euclid: egen m_euclid=mean(var1) . bys euclidsq: egen m_euclidsq=mean(var1) . egen tot1euclid=total((var1-m_euclid)^2) . egen tot1euclidsq=total((var1-m_euclidsq)^2) . sum tot* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- tot1euclid | 49 712.2434 0 712.2434 712.2434 tot1euclidsq | 49 524.9169 0 524.9169 524.9169 . di sqrt(712.2434 ) 26.687889 . di sqrt( 524.9169 ) 22.911065 Groupings generated with the option -measure(L2squared)- applied is superior to the one without. This shouldn't be the case for univariate clustering, or should it?? Have I missed something important? Thank you once again!! Ada On Wed, Mar 11, 2009 at 11:21 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > <> > > . findit group1d > > points to a program in this territory on SSC. > > Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps. > > Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful: > > Ian S. Evans. 1977. > The Selection of Class Intervals. > Transactions of the Institute of British Geographers 2: 98-124. > > It will be accessible to many readers (but not all) through JSTOR. > > I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks". > > The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here. > > -group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative without ever implementing it. > > Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values. > > I see 49 values in her example. Reading those in > > . sort var1 > > . group1d var1, max(7) > > Partitions of 49 data up to 7 groups > > 1 group: sum of squares 8604.04 > Group Size First Last Mean SD > 1 49 1 92.2135 49 144.228 112.04 13.25 > > 2 groups: sum of squares 3059.49 > Group Size First Last Mean SD > 2 33 17 108.97 49 144.228 119.44 9.03 > 1 16 1 92.2135 16 107.54 96.76 4.80 > > 3 groups: sum of squares 1073.41 > Group Size First Last Mean SD > 3 10 40 124.565 49 144.228 130.97 6.70 > 2 24 16 107.54 39 121.712 114.14 3.97 > 1 15 1 92.2135 15 104.744 96.04 4.04 > > 4 groups: sum of squares 524.92 > Group Size First Last Mean SD > 4 4 46 133.568 49 144.228 138.39 4.33 > 3 14 32 116.865 45 127.885 122.00 3.79 > 2 19 13 102.857 31 115.641 110.48 3.51 > 1 12 1 92.2135 12 95.5293 94.09 1.11 > > 5 groups: sum of squares 309.08 > Group Size First Last Mean SD > 5 4 46 133.568 49 144.228 138.39 4.33 > 4 9 37 120.04 45 127.885 124.33 2.61 > 3 14 23 112.013 36 119.072 114.95 2.37 > 2 10 13 102.857 22 111.134 107.88 2.82 > 1 12 1 92.2135 12 95.5293 94.09 1.11 > > 6 groups: sum of squares 185.01 > Group Size First Last Mean SD > 6 4 46 133.568 49 144.228 138.39 4.33 > 5 6 40 124.565 45 127.885 126.03 1.15 > 4 9 31 115.641 39 121.712 118.61 1.91 > 3 14 17 108.97 30 114.67 111.74 1.74 > 2 4 13 102.857 16 107.54 104.77 1.73 > 1 12 1 92.2135 12 95.5293 94.09 1.11 > > 7 groups: sum of squares 116.89 > Group Size First Last Mean SD > 7 2 48 140.798 49 144.228 142.51 1.72 > 6 2 46 133.568 47 134.952 134.26 0.69 > 5 6 40 124.565 45 127.885 126.03 1.15 > 4 9 31 115.641 39 121.712 118.61 1.91 > 3 14 17 108.97 30 114.67 111.74 1.74 > 2 4 13 102.857 16 107.54 104.77 1.73 > 1 12 1 92.2135 12 95.5293 94.09 1.11 > > Groups Sums of squares > 1 8604.04 > 2 3059.49 > 3 1073.41 > 4 524.92 > 5 309.08 > 6 185.01 > 7 116.89 > > It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here. > > . qplot var1, rank xli(12.5 22.5 36.5 45.5) > > The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing. > > Nick > n.j.cox@durham.ac.uk > > Ada Ma > > Thank you to both Partha Deb and Kyle Hood for providing me with some > very promising looking leads to attempt. > > On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <kyle.hood@yale.edu> wrote: > >> In mapping, univariate classification schemes are used to group features >> together. An example is Jenks' natural breaks, which simply defines k-1 >> cutoffs to minimize within-group sums of square deviations from group means. >> Unfortunately, >> >> . findit jenks >> >> produces nothing. However, there is information on the web regarding how to >> compute these cutoffs (just google it). I'm not sure how closely this >> method relates to cluster analysis and finite mixture models. > > Partha Deb wrote: > >>> Although one can never be sure what's in someone else's mind, I suspect >>> you are looking for cluster analysis. -help cluster- . Finite mixture >>> models may also be of interest. -findit fmm- . See >>> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by >>> Stephen Fisher that has an introduction to Cluster analysis and finite >>> mixture models. > > Ada Ma wrote: > >>>> Let's say I have 50 packets of crisps of various weights and I would >>>> like to separate these 50 packets of crisps into five groups based on >>>> their weights in grams, as follows: >>>> >>>> 108.9702 >>>> 111.1337 >>>> 112.5217 >>>> 112.6697 >>>> 112.9962 >>>> 114.0323 >>>> 114.6699 >>>> 116.8646 >>>> 119.0719 >>>> 124.5645 >>>> 124.691 >>>> 126.4943 >>>> 126.5528 >>>> 133.5675 >>>> 134.9519 >>>> 140.7979 >>>> 144.228 >>>> 102.8566 >>>> 103.9373 >>>> 104.7436 >>>> 107.5397 >>>> 109.4443 >>>> 109.7089 >>>> 110.395 >>>> 112.1248 >>>> 113.6032 >>>> 115.6405 >>>> 117.1919 >>>> 120.0395 >>>> 121.0714 >>>> 121.7119 >>>> 110.1116 >>>> 112.0128 >>>> 117.6563 >>>> 118.2418 >>>> 126.0027 >>>> 127.8855 >>>> 92.21352 >>>> 92.45715 >>>> 92.953 >>>> 93.01508 >>>> 94.05335 >>>> 94.27259 >>>> 94.38242 >>>> 94.72507 >>>> 94.83315 >>>> 95.25914 >>>> 95.37813 >>>> 95.52933 >>>> >>>> I don't want to separate them into five equally sized groups. I want >>>> to separate the packets into groups so that the group members are most >>>> similar to one another. I am looking for a method (or methods?) to >>>> achieve this end but I don't know where to start. If you can think of >>>> any suggestion please fire away and I'd be most grateful! >>>> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- Ada Ma Research Fellow Health Economics Research Unit University of Aberdeen, UK. http://www.abdn.ac.uk/heru/ Tel: +44 (0) 1224 553863 Fax: +44 (0) 1224 550926 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**References**:**st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Ada Ma <heu034@googlemail.com>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Partha Deb <partha.deb@hunter.cuny.edu>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*"Kyle K. Hood" <kyle.hood@yale.edu>

**Re: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*Ada Ma <heu034@googlemail.com>

**RE: st: are there any statistics rules that I can apply to separate numbers into groups?***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: Update to -estout- available from SSC** - Next by Date:
**st: Estimating Simultaneous Equations Models with GMM 3SLS estimator** - Previous by thread:
**RE: st: are there any statistics rules that I can apply to separate numbers into groups?** - Next by thread:
**RE: st: are there any statistics rules that I can apply to separate numbers into groups?** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |