Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Cluster analysis

Subject   Re: st: Cluster analysis
Date   Tue, 18 May 2004 11:59:30 -0500

Justin <> asks:

> I am trying to do a kmeans cluster analysis, and I have a couple
> issues that keep coming up.  First, I use the default random
> start option and get certain results. Then I used the segments
> start option, and the results are quite different.  Is there any
> explanation for this?

This is often an indication that your data can not be separated
into k (whatever k you asked for) groups very well.  Try the
following experiment to get a feel for why this might be true.

Generate random data (no real groups in the data).  Run -cluster
kmeans- with various starting values.  do some cross tabs of the
resulting grouping variables to get a feel for how much they
disagree.  Also run -cluster stop- for each run and see how the
pseudo F value changes.

For instance I did

    set seed 412389
    set obs 500
    gen x1 = uniform()
    gen x2 = uniform()

    cluster kmeans x1 x2, k(5) name(try1) start(krandom(28392))
    cluster kmeans x1 x2, k(5) name(try2) start(krandom(11833))
    cluster kmeans x1 x2, k(5) name(try3) start(krandom(3216))

    cluster stop try1
    cluster stop try2
    cluster stop try3

    tab try1 try2
    tab try1 try3
    tab try2 try3

Now try a similar experiment on data that has a reasonable chance
of having some natural groupings within the data.

    sysuse auto, clear
    cluster kmeans price length , k(5) name(a1) start(kr(4484))
    cluster kmeans price length , k(5) name(a2) start(kr(33232))
    cluster kmeans price length , k(5) name(a3) start(kr(678213))

    cluster stop a1

    tab a1 a2

You will find more agreement between the results, then with the
2-dimensional random uniform data.  You will still find some of
them going to different solutions, because this particular case
does not naturally break into 5 groups, but does so better than
the totally random data.

The usual strategy is to make several (maybe many) runs with
different starting values and take the solution that gives the
largest value produced by -cluster stop-.  If you saw many
different solutions while doing this, then it is an indication
that you are trying to force the data into groups that are not

> Also, I tried to use group(varname) as a start option, but when I
> run this, I keep getting an error message that the variable I
> chose" does not define k (in my case, 5) groups".  How can I fix
> this?

Does your variable "varname" have 5 and only 5 levels?  When I try
something like

    sysuse auto
    cluster kmeans price length, k(5) start(groups(rep78))

works fine for me.  rep78 takes on 5 possible values, and I
correspondingly asked for "k(5)" with -cluster kmeans-.

Ken Higbee
StataCorp     1-800-STATAPC

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index