Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Cluster analyis on hand made distance matrix


From   Ulrich Kohler <kohler@wzb.eu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Cluster analyis on hand made distance matrix
Date   Wed, 12 Mar 2008 09:33:04 +0100

Thank you very much, indeed.

Am Dienstag, den 11.03.2008, 11:46 -0500 schrieb khigbee@stata.com:
> Ulrich Kohler <kohler@wzb.eu> sent me the SQdist1 and SQdist2 matrices
> from the query below.
> 
> >> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both
> >> distance matrizes are essentially identical, with the exception that
> >> they are differently ordered.
> >>
> >> If I perform a cluster analysis using singlelinkage for the two distance
> >> matrizes, I get identical results:
> >> 
> >> <cut>
> >>
> >> (The same is true for median-linkage and centroid linkage.)
> >> 
> >> However, if I use wards-linkage I get different results for the two
> >> distance matrizes:
> >> 
> >> . clustermat wards SQdist1, name(cluster1) add
> >> . clustermat wards SQdist2, name(cluster2) add
> >> . sum *_hgt
> >>
> >>     Variable |       Obs        Mean    Std. Dev.       Min        Max
> >> -------------+--------------------------------------------------------
> >> cluster1_hgt |        53    .7051013     .861406   .1666667   4.414418
> >> cluster2_hgt |        53    .7051013    .8751653   .1666667   4.645984
> >> 
> >> Although the difference doesn't seem large, it have led to quite
> >> different groupings in a practical application. Unfortunately, I am not
> >> an expert with cluster analysis. So, please, can anybody explain me why
> >> this happens? If the order of distance matrix matter for
> >> cluster-analysis, what is the "correct" order of the distance matrix,
> >> then?
> >
> > The hierarchical cluster analysis methods start with N groups
> > (each observation is a group).  At each step in the process the 2
> > closest groups are merged and this is continued until all
> > observations are in one group.  This can be viewed as a
> > dendrogram (cluster tree).
> > 
> > My guess is that there are ties in determining the closest 2
> > groups at one or more steps in the process and the order that the
> > data is presented changes which of these ties gets selected for
> > merging together at that step.
> > 
> > If Uli would like me to explore this further, he can send me the
> > SQdist1 and SQdist2 matrices and I will report back what I find.
> > 
> > Ken Higbee    khigbee@stata.com
> > StataCorp     1-800-STATAPC
> 
> My guess was correct.  The difference is due to ties in the
> dissimilarities.
> 
> The matrices are 54 x 54 and have lots of ties in the
> dissimilarities.  There are 54*53/2 = 1,431 elements in the
> strictly lower triangle of the matrix.  Of those 1,431
> dissimilarites, there are only 9 distinct values
> 
>           value     count
>        ------------------
>        .1666667        48
>        .3333333       138
>              .5       268
>        .6666667       385
>        .8333333       295
>               1       205
>        1.166667        62
>        1.333333        23
>             1.5         7
> 
> At the first step of the hierarchical clustering there are 48
> ties for smallest dissimilarity.  One of these is picked for
> combining 2 of the groups into 1, and the resulting 53 x 53
> dissimilarity matrix is then created from the original 54 x 54
> matrix using the Lance and Williams' recurrence formula.  See
> pages 86-87 of "[MV] cluster" in the Version 10 [MV] manual.  The
> process is then repeated.
> 
> Given the number of ties in the original 54 x 54 dissimilarity
> matrix, I expect that ties for smallest dissimilarity happen
> often during the steps of the algorithm.
> 
> Changing the order of the original dissimilarity matrix in this
> example will usually result in different tied pairs being
> combined at different stages of the algorithm.
> 
> Why did Uli notice the difference with Ward's linkage and not
> some of the other linkages?  Look at the table on page 87 of the
> manual for the recurrence formula.  You will see that alpha_i,
> alpha_j, and beta involve the group sizes for group i and group j
> (the groups being combined) and group k.  Some of the other
> linkages have simpler forms.  I believe this is why the
> differences due to ties is more apparent in his Ward's linkage
> results.
> 
> Uli only showed the summary of the _hgt variables.  Even for
> those linkages where his _hgt variable had the same summary (same
> mean, min, max, ...), he will probably see that the _hgt and _ord
> variables are different (not the summary, but the actual values)
> from one run to the other if the order has changed and there are
> ties involved.  The _ord variable indicates the order the
> clusters are joined in the hierarchy.  This will change depending
> on which tie is picked.
> 
> Uli also wonders which ordering is "correct".  I don't think that
> question has an answer.  It is not a matter of one solution being
> correct and the others being incorrect.
> 
> The hierarchical clustering methods were not designed to provide
> an "optimal" clustering solution.  For that you would have to try
> all possibilities (which is not feasible for most size problems)
> and if you did try all possiblities, you would most often not end
> up with a hierarchical solution (e.g., the optimal 8 group
> solution may not nest the optimal 7 group solution).
> 
> Ken Higbee    khigbee@stata.com
> StataCorp     1-800-STATAPC
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index