# Re: Re: st: are there any statistics rules that I can apply to separate numbers into groups?

 From khigbee@stata.com To statalist@hsphsun2.harvard.edu Subject Re: Re: st: are there any statistics rules that I can apply to separate numbers into groups? Date Thu, 12 Mar 2009 08:50:51 -0500

```Ada Ma <heu034@googlemail.com> asks:

> ...  I was also playing around with the -cluster kmeans-
> command and find that -group1d- generates the same groupings -cluster
> kmeans- with the option -measure(L2squared)- applied.

Nick Cox <n.j.cox@durham.ac.uk> responds:

>> The fact that you get the same results with -group1d- and a k-means
>> approach is good fortune, as k-means methods don't guarantee that an
>> optimum will be found.

> I then compare the results of -cluster kmeans- with or without the
> -measure(L2squared)- option specified.  The result groupings are
> different.  I don't really understand why this should be the case for
> univariate clustering, because when I typed:
>
> help measure_option  (note the underscore between the words measure
> and option, without the underscore a different help file will show up)
>
> It is explained that the default option calculates the grouping by minimising:
>         requests the Euclidean distance / Minkowski distance metric
> with argument 2
>
>                            sqrt(sum((x_ia - x_ja)^2))
>
> But when the option -measure(L2squared)- is specified
>        grouping is assigned by minimising the square of the Euclidean
> distance / Minkowski distance metric with argument 2
>
>                               sum((x_ia - x_ja)^2)
>
> Here are some output generated using the same 49 observations:
>
> ... <output omitted> ...

Look at the -start()- option detailed in -help cluster_kmeans-,
notice that many of the suboptions of -start()- take a random
number seed value as an argument, including the default
-start(krandom())-.

As Nick pointed out, it was just luck that your first run of
-cluster kmeans- produced the same clustering as -group1d-.  Set
the random number seed to different values before several runs
and you might get several different answers.  Kmeans clustering
does not guarantee to find an optimal solution.

>> The main point of -group1d- is that it produces classes that are
>> contiguous intervals in one dimension. In contrast -cluster- has
>> no notion of contiguity.

Kmeans clustering can be applied to any number of dimensions.
The case of having only 1 dimension is not given any special
treatment.

Side note:  Ada indicates that -help measure_option- and
-help measure option- display different help files.  I can not
reproduce that behavior.  It displays the same help file for me.
Ada can you reproduce that behavior?  If so email me and tell
and -update query- in your Stata).

Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```