Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: are there any statistics rules that I can apply to separate numbers into groups?


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: are there any statistics rules that I can apply to separate numbers into groups?
Date   Thu, 12 Mar 2009 12:41:34 -0000

The fact that you get the same results with -group1d- and a k-means
approach is good fortune, as k-means methods don't guarantee that an
optimum will be found. 

The main point of -group1d- is that it produces classes that are
contiguous intervals in one dimension. In contrast -cluster- has no
notion of contiguity. 

Your main question is about -cluster- and is best left to Ken Higbee, I
suspect. 

Nick 
n.j.cox@durham.ac.uk 

Ada Ma

Thanks to Nick for introducing me to this wonderful command -group1d-.
 It's exactly what I was looking for.

I have some further questions - which I hope someone would help me to
understand.  I was also playing around with the -cluster kmeans-
command and find that -group1d- generates the same groupings -cluster
kmeans- with the option -measure(L2squared)- applied.

I then compare the results of -cluster kmeans- with or without the
-measure(L2squared)- option specified.  The result groupings are
different.  I don't really understand why this should be the case for
univariate clustering, because when I typed:

help measure_option  (note the underscore between the words measure
and option, without the underscore a different help file will show up)

It is explained that the default option calculates the grouping by
minimising:
        requests the Euclidean distance / Minkowski distance metric
with argument 2

                           sqrt(sum((x_ia - x_ja)^2))

But when the option -measure(L2squared)- is specified
       grouping is assigned by minimising the square of the Euclidean
distance / Minkowski distance metric with argument 2

                              sum((x_ia - x_ja)^2)


Here are some output generated using the same 49 observations:

. cluster kmeans var1, k(4) generate(euclid)
cluster name: _clus_5

. cluster kmeans var1, k(4) generate(euclidsq) measure(L2squared)
cluster name: _clus_1


. tab  euclid euclidsq

           |                  euclidsq
    euclid |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
         1 |        10          0          0          0 |        10
         2 |         0          0         12          0 |        12
         3 |         0          4          0          6 |        10
         4 |         9          0          0          8 |        17
-----------+--------------------------------------------+----------
     Total |        19          4         12         14 |        49


. bys euclid: egen m_euclid=mean(var1)

. bys euclidsq: egen m_euclidsq=mean(var1)

. egen tot1euclid=total((var1-m_euclid)^2)

. egen tot1euclidsq=total((var1-m_euclidsq)^2)

. sum tot*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
  tot1euclid |        49    712.2434           0   712.2434   712.2434
tot1euclidsq |        49    524.9169           0   524.9169   524.9169

. di sqrt(712.2434 )
26.687889

. di sqrt( 524.9169  )
22.911065


Groupings generated with the option -measure(L2squared)- applied is
superior to the one without.  This shouldn't be the case for
univariate clustering, or should it??  Have I missed something
important?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index