[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
khigbee@stata.com |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Cluster analyis on hand made distance matrix |

Date |
Tue, 11 Mar 2008 11:46:43 -0500 |

Ulrich Kohler <kohler@wzb.eu> sent me the SQdist1 and SQdist2 matrices from the query below. >> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both >> distance matrizes are essentially identical, with the exception that >> they are differently ordered. >> >> If I perform a cluster analysis using singlelinkage for the two distance >> matrizes, I get identical results: >> >> <cut> >> >> (The same is true for median-linkage and centroid linkage.) >> >> However, if I use wards-linkage I get different results for the two >> distance matrizes: >> >> . clustermat wards SQdist1, name(cluster1) add >> . clustermat wards SQdist2, name(cluster2) add >> . sum *_hgt >> >> Variable | Obs Mean Std. Dev. Min Max >> -------------+-------------------------------------------------------- >> cluster1_hgt | 53 .7051013 .861406 .1666667 4.414418 >> cluster2_hgt | 53 .7051013 .8751653 .1666667 4.645984 >> >> Although the difference doesn't seem large, it have led to quite >> different groupings in a practical application. Unfortunately, I am not >> an expert with cluster analysis. So, please, can anybody explain me why >> this happens? If the order of distance matrix matter for >> cluster-analysis, what is the "correct" order of the distance matrix, >> then? > > The hierarchical cluster analysis methods start with N groups > (each observation is a group). At each step in the process the 2 > closest groups are merged and this is continued until all > observations are in one group. This can be viewed as a > dendrogram (cluster tree). > > My guess is that there are ties in determining the closest 2 > groups at one or more steps in the process and the order that the > data is presented changes which of these ties gets selected for > merging together at that step. > > If Uli would like me to explore this further, he can send me the > SQdist1 and SQdist2 matrices and I will report back what I find. > > Ken Higbee khigbee@stata.com > StataCorp 1-800-STATAPC My guess was correct. The difference is due to ties in the dissimilarities. The matrices are 54 x 54 and have lots of ties in the dissimilarities. There are 54*53/2 = 1,431 elements in the strictly lower triangle of the matrix. Of those 1,431 dissimilarites, there are only 9 distinct values value count ------------------ .1666667 48 .3333333 138 .5 268 .6666667 385 .8333333 295 1 205 1.166667 62 1.333333 23 1.5 7 At the first step of the hierarchical clustering there are 48 ties for smallest dissimilarity. One of these is picked for combining 2 of the groups into 1, and the resulting 53 x 53 dissimilarity matrix is then created from the original 54 x 54 matrix using the Lance and Williams' recurrence formula. See pages 86-87 of "[MV] cluster" in the Version 10 [MV] manual. The process is then repeated. Given the number of ties in the original 54 x 54 dissimilarity matrix, I expect that ties for smallest dissimilarity happen often during the steps of the algorithm. Changing the order of the original dissimilarity matrix in this example will usually result in different tied pairs being combined at different stages of the algorithm. Why did Uli notice the difference with Ward's linkage and not some of the other linkages? Look at the table on page 87 of the manual for the recurrence formula. You will see that alpha_i, alpha_j, and beta involve the group sizes for group i and group j (the groups being combined) and group k. Some of the other linkages have simpler forms. I believe this is why the differences due to ties is more apparent in his Ward's linkage results. Uli only showed the summary of the _hgt variables. Even for those linkages where his _hgt variable had the same summary (same mean, min, max, ...), he will probably see that the _hgt and _ord variables are different (not the summary, but the actual values) from one run to the other if the order has changed and there are ties involved. The _ord variable indicates the order the clusters are joined in the hierarchy. This will change depending on which tie is picked. Uli also wonders which ordering is "correct". I don't think that question has an answer. It is not a matter of one solution being correct and the others being incorrect. The hierarchical clustering methods were not designed to provide an "optimal" clustering solution. For that you would have to try all possibilities (which is not feasible for most size problems) and if you did try all possiblities, you would most often not end up with a hierarchical solution (e.g., the optimal 8 group solution may not nest the optimal 7 group solution). Ken Higbee khigbee@stata.com StataCorp 1-800-STATAPC * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Cluster analyis on hand made distance matrix***From:*Ulrich Kohler <kohler@wzb.eu>

- Prev by Date:
**st: RE: RE: Confidence Interval for Proportion** - Next by Date:
**st: RE: RE: RE: Confidence Interval for Proportion** - Previous by thread:
**Re: st: Cluster analyis on hand made distance matrix** - Next by thread:
**Re: st: Cluster analyis on hand made distance matrix** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |