[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
khigbee@stata.com |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Cluster analysis |

Date |
Tue, 18 May 2004 11:59:30 -0500 |

Justin <jdubas@nd.edu> asks: > I am trying to do a kmeans cluster analysis, and I have a couple > issues that keep coming up. First, I use the default random > start option and get certain results. Then I used the segments > start option, and the results are quite different. Is there any > explanation for this? This is often an indication that your data can not be separated into k (whatever k you asked for) groups very well. Try the following experiment to get a feel for why this might be true. Generate random data (no real groups in the data). Run -cluster kmeans- with various starting values. do some cross tabs of the resulting grouping variables to get a feel for how much they disagree. Also run -cluster stop- for each run and see how the pseudo F value changes. For instance I did clear set seed 412389 set obs 500 gen x1 = uniform() gen x2 = uniform() cluster kmeans x1 x2, k(5) name(try1) start(krandom(28392)) cluster kmeans x1 x2, k(5) name(try2) start(krandom(11833)) cluster kmeans x1 x2, k(5) name(try3) start(krandom(3216)) cluster stop try1 cluster stop try2 cluster stop try3 tab try1 try2 tab try1 try3 tab try2 try3 Now try a similar experiment on data that has a reasonable chance of having some natural groupings within the data. sysuse auto, clear cluster kmeans price length , k(5) name(a1) start(kr(4484)) cluster kmeans price length , k(5) name(a2) start(kr(33232)) cluster kmeans price length , k(5) name(a3) start(kr(678213)) ... cluster stop a1 ... tab a1 a2 ... You will find more agreement between the results, then with the 2-dimensional random uniform data. You will still find some of them going to different solutions, because this particular case does not naturally break into 5 groups, but does so better than the totally random data. The usual strategy is to make several (maybe many) runs with different starting values and take the solution that gives the largest value produced by -cluster stop-. If you saw many different solutions while doing this, then it is an indication that you are trying to force the data into groups that are not distinct. > Also, I tried to use group(varname) as a start option, but when I > run this, I keep getting an error message that the variable I > chose" does not define k (in my case, 5) groups". How can I fix > this? Does your variable "varname" have 5 and only 5 levels? When I try something like sysuse auto cluster kmeans price length, k(5) start(groups(rep78)) works fine for me. rep78 takes on 5 possible values, and I correspondingly asked for "k(5)" with -cluster kmeans-. Ken Higbee khigbee@stata.com StataCorp 1-800-STATAPC * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: Re: Re: RE: Time taken for reshape command** - Next by Date:
**st: Tiny wish for pull down menu File Filename** - Previous by thread:
**st: Cluster analysis** - Next by thread:
**st: RE: Time taken for reshape command** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |