[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Ulrich Kohler <kohler@wzb.eu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Cluster analyis on hand made distance matrix |

Date |
Wed, 12 Mar 2008 09:33:04 +0100 |

Thank you very much, indeed. Am Dienstag, den 11.03.2008, 11:46 -0500 schrieb khigbee@stata.com: > Ulrich Kohler <kohler@wzb.eu> sent me the SQdist1 and SQdist2 matrices > from the query below. > > >> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both > >> distance matrizes are essentially identical, with the exception that > >> they are differently ordered. > >> > >> If I perform a cluster analysis using singlelinkage for the two distance > >> matrizes, I get identical results: > >> > >> <cut> > >> > >> (The same is true for median-linkage and centroid linkage.) > >> > >> However, if I use wards-linkage I get different results for the two > >> distance matrizes: > >> > >> . clustermat wards SQdist1, name(cluster1) add > >> . clustermat wards SQdist2, name(cluster2) add > >> . sum *_hgt > >> > >> Variable | Obs Mean Std. Dev. Min Max > >> -------------+-------------------------------------------------------- > >> cluster1_hgt | 53 .7051013 .861406 .1666667 4.414418 > >> cluster2_hgt | 53 .7051013 .8751653 .1666667 4.645984 > >> > >> Although the difference doesn't seem large, it have led to quite > >> different groupings in a practical application. Unfortunately, I am not > >> an expert with cluster analysis. So, please, can anybody explain me why > >> this happens? If the order of distance matrix matter for > >> cluster-analysis, what is the "correct" order of the distance matrix, > >> then? > > > > The hierarchical cluster analysis methods start with N groups > > (each observation is a group). At each step in the process the 2 > > closest groups are merged and this is continued until all > > observations are in one group. This can be viewed as a > > dendrogram (cluster tree). > > > > My guess is that there are ties in determining the closest 2 > > groups at one or more steps in the process and the order that the > > data is presented changes which of these ties gets selected for > > merging together at that step. > > > > If Uli would like me to explore this further, he can send me the > > SQdist1 and SQdist2 matrices and I will report back what I find. > > > > Ken Higbee khigbee@stata.com > > StataCorp 1-800-STATAPC > > My guess was correct. The difference is due to ties in the > dissimilarities. > > The matrices are 54 x 54 and have lots of ties in the > dissimilarities. There are 54*53/2 = 1,431 elements in the > strictly lower triangle of the matrix. Of those 1,431 > dissimilarites, there are only 9 distinct values > > value count > ------------------ > .1666667 48 > .3333333 138 > .5 268 > .6666667 385 > .8333333 295 > 1 205 > 1.166667 62 > 1.333333 23 > 1.5 7 > > At the first step of the hierarchical clustering there are 48 > ties for smallest dissimilarity. One of these is picked for > combining 2 of the groups into 1, and the resulting 53 x 53 > dissimilarity matrix is then created from the original 54 x 54 > matrix using the Lance and Williams' recurrence formula. See > pages 86-87 of "[MV] cluster" in the Version 10 [MV] manual. The > process is then repeated. > > Given the number of ties in the original 54 x 54 dissimilarity > matrix, I expect that ties for smallest dissimilarity happen > often during the steps of the algorithm. > > Changing the order of the original dissimilarity matrix in this > example will usually result in different tied pairs being > combined at different stages of the algorithm. > > Why did Uli notice the difference with Ward's linkage and not > some of the other linkages? Look at the table on page 87 of the > manual for the recurrence formula. You will see that alpha_i, > alpha_j, and beta involve the group sizes for group i and group j > (the groups being combined) and group k. Some of the other > linkages have simpler forms. I believe this is why the > differences due to ties is more apparent in his Ward's linkage > results. > > Uli only showed the summary of the _hgt variables. Even for > those linkages where his _hgt variable had the same summary (same > mean, min, max, ...), he will probably see that the _hgt and _ord > variables are different (not the summary, but the actual values) > from one run to the other if the order has changed and there are > ties involved. The _ord variable indicates the order the > clusters are joined in the hierarchy. This will change depending > on which tie is picked. > > Uli also wonders which ordering is "correct". I don't think that > question has an answer. It is not a matter of one solution being > correct and the others being incorrect. > > The hierarchical clustering methods were not designed to provide > an "optimal" clustering solution. For that you would have to try > all possibilities (which is not feasible for most size problems) > and if you did try all possiblities, you would most often not end > up with a hierarchical solution (e.g., the optimal 8 group > solution may not nest the optimal 7 group solution). > > Ken Higbee khigbee@stata.com > StataCorp 1-800-STATAPC > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Cluster analyis on hand made distance matrix***From:*khigbee@stata.com

- Prev by Date:
**Re: st: How to - scatter plot with regression line after adjustment for covariates** - Next by Date:
**st: ssc hot** - Previous by thread:
**Re: st: Cluster analyis on hand made distance matrix** - Next by thread:
**st: ivprobit estimation with weak instruments** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |