# Re: st: are there any statistics rules that I can apply to separate numbers into groups?

 From Ada Ma To statalist@hsphsun2.harvard.edu Subject Re: st: are there any statistics rules that I can apply to separate numbers into groups? Date Thu, 12 Mar 2009 10:20:41 +0000

```Thanks to Nick for introducing me to this wonderful command -group1d-.
It's exactly what I was looking for.

I have some further questions - which I hope someone would help me to
understand.  I was also playing around with the -cluster kmeans-
command and find that -group1d- generates the same groupings -cluster
kmeans- with the option -measure(L2squared)- applied.

I then compare the results of -cluster kmeans- with or without the
-measure(L2squared)- option specified.  The result groupings are
different.  I don't really understand why this should be the case for
univariate clustering, because when I typed:

help measure_option  (note the underscore between the words measure
and option, without the underscore a different help file will show up)

It is explained that the default option calculates the grouping by minimising:
requests the Euclidean distance / Minkowski distance metric
with argument 2

sqrt(sum((x_ia - x_ja)^2))

But when the option -measure(L2squared)- is specified
grouping is assigned by minimising the square of the Euclidean
distance / Minkowski distance metric with argument 2

sum((x_ia - x_ja)^2)

Here are some output generated using the same 49 observations:

. cluster kmeans var1, k(4) generate(euclid)
cluster name: _clus_5

. cluster kmeans var1, k(4) generate(euclidsq) measure(L2squared)
cluster name: _clus_1

. tab  euclid euclidsq

|                  euclidsq
euclid |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
1 |        10          0          0          0 |        10
2 |         0          0         12          0 |        12
3 |         0          4          0          6 |        10
4 |         9          0          0          8 |        17
-----------+--------------------------------------------+----------
Total |        19          4         12         14 |        49

. bys euclid: egen m_euclid=mean(var1)

. bys euclidsq: egen m_euclidsq=mean(var1)

. egen tot1euclid=total((var1-m_euclid)^2)

. egen tot1euclidsq=total((var1-m_euclidsq)^2)

. sum tot*

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
tot1euclid |        49    712.2434           0   712.2434   712.2434
tot1euclidsq |        49    524.9169           0   524.9169   524.9169

. di sqrt(712.2434 )
26.687889

. di sqrt( 524.9169  )
22.911065

Groupings generated with the option -measure(L2squared)- applied is
superior to the one without.  This shouldn't be the case for
univariate clustering, or should it??  Have I missed something
important?

Thank you once again!!

On Wed, Mar 11, 2009 at 11:21 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> <>
>
> . findit group1d
>
> points to a program in this territory on SSC.
>
> Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps.
>
> Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful:
>
> Ian S. Evans. 1977.
> The Selection of Class Intervals.
> Transactions of the Institute of British Geographers 2: 98-124.
>
> It will be accessible to many readers (but not all) through JSTOR.
>
> I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks".
>
> The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here.
>
> -group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative  without ever implementing it.
>
> Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values.
>
> I see 49 values in her example. Reading those in
>
> . sort var1
>
> . group1d var1, max(7)
>
>  Partitions of 49 data up to 7 groups
>
>  1 group:  sum of squares 8604.04
>  Group Size    First            Last           Mean      SD
>   1      49    1  92.2135      49  144.228   112.04   13.25
>
>  2 groups: sum of squares 3059.49
>  Group Size    First            Last           Mean      SD
>   2      33   17   108.97      49  144.228   119.44    9.03
>   1      16    1  92.2135      16   107.54    96.76    4.80
>
>  3 groups: sum of squares 1073.41
>  Group Size    First            Last           Mean      SD
>   3      10   40  124.565      49  144.228   130.97    6.70
>   2      24   16   107.54      39  121.712   114.14    3.97
>   1      15    1  92.2135      15  104.744    96.04    4.04
>
>  4 groups: sum of squares 524.92
>  Group Size    First            Last           Mean      SD
>   4       4   46  133.568      49  144.228   138.39    4.33
>   3      14   32  116.865      45  127.885   122.00    3.79
>   2      19   13  102.857      31  115.641   110.48    3.51
>   1      12    1  92.2135      12  95.5293    94.09    1.11
>
>  5 groups: sum of squares 309.08
>  Group Size    First            Last           Mean      SD
>   5       4   46  133.568      49  144.228   138.39    4.33
>   4       9   37   120.04      45  127.885   124.33    2.61
>   3      14   23  112.013      36  119.072   114.95    2.37
>   2      10   13  102.857      22  111.134   107.88    2.82
>   1      12    1  92.2135      12  95.5293    94.09    1.11
>
>  6 groups: sum of squares 185.01
>  Group Size    First            Last           Mean      SD
>   6       4   46  133.568      49  144.228   138.39    4.33
>   5       6   40  124.565      45  127.885   126.03    1.15
>   4       9   31  115.641      39  121.712   118.61    1.91
>   3      14   17   108.97      30   114.67   111.74    1.74
>   2       4   13  102.857      16   107.54   104.77    1.73
>   1      12    1  92.2135      12  95.5293    94.09    1.11
>
>  7 groups: sum of squares 116.89
>  Group Size    First            Last           Mean      SD
>   7       2   48  140.798      49  144.228   142.51    1.72
>   6       2   46  133.568      47  134.952   134.26    0.69
>   5       6   40  124.565      45  127.885   126.03    1.15
>   4       9   31  115.641      39  121.712   118.61    1.91
>   3      14   17   108.97      30   114.67   111.74    1.74
>   2       4   13  102.857      16   107.54   104.77    1.73
>   1      12    1  92.2135      12  95.5293    94.09    1.11
>
>  Groups     Sums of squares
>     1         8604.04
>     2         3059.49
>     3         1073.41
>     4          524.92
>     5          309.08
>     6          185.01
>     7          116.89
>
> It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here.
>
> . qplot var1, rank xli(12.5 22.5 36.5 45.5)
>
> The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing.
>
> Nick
> n.j.cox@durham.ac.uk
>
>
> Thank you to both Partha Deb and Kyle Hood for providing me with some
> very promising looking leads to attempt.
>
> On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <kyle.hood@yale.edu> wrote:
>
>> In mapping, univariate classification schemes are used to group features
>> together.  An example is Jenks' natural breaks, which simply defines k-1
>> cutoffs to minimize within-group sums of square deviations from group means.
>>  Unfortunately,
>>
>> . findit jenks
>>
>> produces nothing.  However, there is information on the web regarding how to
>> compute these cutoffs (just google it).  I'm not sure how closely this
>> method relates to cluster analysis and finite mixture models.
>
> Partha Deb wrote:
>
>>> Although one can never be sure what's in someone else's mind, I suspect
>>> you are looking for cluster analysis. -help cluster- .  Finite mixture
>>> models may also be of interest. -findit fmm- .  See
>>> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by
>>> Stephen Fisher that has an introduction to Cluster analysis and finite
>>> mixture models.
>
>
>>>> Let's say I have 50 packets of crisps of various weights and I would
>>>> like to separate these 50 packets of crisps into five groups based on
>>>> their weights in grams, as follows:
>>>>
>>>> 108.9702
>>>> 111.1337
>>>> 112.5217
>>>> 112.6697
>>>> 112.9962
>>>> 114.0323
>>>> 114.6699
>>>> 116.8646
>>>> 119.0719
>>>> 124.5645
>>>> 124.691
>>>> 126.4943
>>>> 126.5528
>>>> 133.5675
>>>> 134.9519
>>>> 140.7979
>>>> 144.228
>>>> 102.8566
>>>> 103.9373
>>>> 104.7436
>>>> 107.5397
>>>> 109.4443
>>>> 109.7089
>>>> 110.395
>>>> 112.1248
>>>> 113.6032
>>>> 115.6405
>>>> 117.1919
>>>> 120.0395
>>>> 121.0714
>>>> 121.7119
>>>> 110.1116
>>>> 112.0128
>>>> 117.6563
>>>> 118.2418
>>>> 126.0027
>>>> 127.8855
>>>> 92.21352
>>>> 92.45715
>>>> 92.953
>>>> 93.01508
>>>> 94.05335
>>>> 94.27259
>>>> 94.38242
>>>> 94.72507
>>>> 94.83315
>>>> 95.25914
>>>> 95.37813
>>>> 95.52933
>>>>
>>>> I don't want to separate them into five equally sized groups.  I want
>>>> to separate the packets into groups so that the group members are most
>>>> similar to one another.  I am looking for a method (or methods?) to
>>>> achieve this end but I don't know where to start.  If you can think of
>>>> any suggestion please fire away and I'd be most grateful!
>>>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
Research Fellow
Health Economics Research Unit
University of Aberdeen, UK.
http://www.abdn.ac.uk/heru/
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```