Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: are there any statistics rules that I can apply to separate numbers into groups?


From   khigbee@stata.com
To   statalist@hsphsun2.harvard.edu
Subject   Re: Re: st: are there any statistics rules that I can apply to separate numbers into groups?
Date   Thu, 12 Mar 2009 08:50:51 -0500

Ada Ma <heu034@googlemail.com> asks:

> ...  I was also playing around with the -cluster kmeans-
> command and find that -group1d- generates the same groupings -cluster
> kmeans- with the option -measure(L2squared)- applied.

Nick Cox <n.j.cox@durham.ac.uk> responds:

>> The fact that you get the same results with -group1d- and a k-means
>> approach is good fortune, as k-means methods don't guarantee that an
>> optimum will be found.

Ada's question continues:

> I then compare the results of -cluster kmeans- with or without the
> -measure(L2squared)- option specified.  The result groupings are
> different.  I don't really understand why this should be the case for
> univariate clustering, because when I typed:
>
> help measure_option  (note the underscore between the words measure
> and option, without the underscore a different help file will show up)
> 
> It is explained that the default option calculates the grouping by minimising:
>         requests the Euclidean distance / Minkowski distance metric
> with argument 2
> 
>                            sqrt(sum((x_ia - x_ja)^2))
> 
> But when the option -measure(L2squared)- is specified
>        grouping is assigned by minimising the square of the Euclidean
> distance / Minkowski distance metric with argument 2
> 
>                               sum((x_ia - x_ja)^2)
> 
> Here are some output generated using the same 49 observations:
>
> ... <output omitted> ...

Look at the -start()- option detailed in -help cluster_kmeans-,
notice that many of the suboptions of -start()- take a random
number seed value as an argument, including the default
-start(krandom())-.

As Nick pointed out, it was just luck that your first run of
-cluster kmeans- produced the same clustering as -group1d-.  Set
the random number seed to different values before several runs
and you might get several different answers.  Kmeans clustering
does not guarantee to find an optimal solution.

Quoting Nick Cox's answer:

>> The main point of -group1d- is that it produces classes that are
>> contiguous intervals in one dimension. In contrast -cluster- has
>> no notion of contiguity.

Kmeans clustering can be applied to any number of dimensions.
The case of having only 1 dimension is not given any special
treatment.

Side note:  Ada indicates that -help measure_option- and
-help measure option- display different help files.  I can not
reproduce that behavior.  It displays the same help file for me.
Ada can you reproduce that behavior?  If so email me and tell
me more about your setup (send me the output of typing -about-
and -update query- in your Stata).


Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index