Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Cluster analyis on hand made distance matrix


From   khigbee@stata.com
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Cluster analyis on hand made distance matrix
Date   Tue, 11 Mar 2008 11:46:43 -0500

Ulrich Kohler <kohler@wzb.eu> sent me the SQdist1 and SQdist2 matrices
from the query below.

>> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both
>> distance matrizes are essentially identical, with the exception that
>> they are differently ordered.
>>
>> If I perform a cluster analysis using singlelinkage for the two distance
>> matrizes, I get identical results:
>> 
>> <cut>
>>
>> (The same is true for median-linkage and centroid linkage.)
>> 
>> However, if I use wards-linkage I get different results for the two
>> distance matrizes:
>> 
>> . clustermat wards SQdist1, name(cluster1) add
>> . clustermat wards SQdist2, name(cluster2) add
>> . sum *_hgt
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>> cluster1_hgt |        53    .7051013     .861406   .1666667   4.414418
>> cluster2_hgt |        53    .7051013    .8751653   .1666667   4.645984
>> 
>> Although the difference doesn't seem large, it have led to quite
>> different groupings in a practical application. Unfortunately, I am not
>> an expert with cluster analysis. So, please, can anybody explain me why
>> this happens? If the order of distance matrix matter for
>> cluster-analysis, what is the "correct" order of the distance matrix,
>> then?
>
> The hierarchical cluster analysis methods start with N groups
> (each observation is a group).  At each step in the process the 2
> closest groups are merged and this is continued until all
> observations are in one group.  This can be viewed as a
> dendrogram (cluster tree).
> 
> My guess is that there are ties in determining the closest 2
> groups at one or more steps in the process and the order that the
> data is presented changes which of these ties gets selected for
> merging together at that step.
> 
> If Uli would like me to explore this further, he can send me the
> SQdist1 and SQdist2 matrices and I will report back what I find.
> 
> Ken Higbee    khigbee@stata.com
> StataCorp     1-800-STATAPC

My guess was correct.  The difference is due to ties in the
dissimilarities.

The matrices are 54 x 54 and have lots of ties in the
dissimilarities.  There are 54*53/2 = 1,431 elements in the
strictly lower triangle of the matrix.  Of those 1,431
dissimilarites, there are only 9 distinct values

          value     count
       ------------------
       .1666667        48
       .3333333       138
             .5       268
       .6666667       385
       .8333333       295
              1       205
       1.166667        62
       1.333333        23
            1.5         7

At the first step of the hierarchical clustering there are 48
ties for smallest dissimilarity.  One of these is picked for
combining 2 of the groups into 1, and the resulting 53 x 53
dissimilarity matrix is then created from the original 54 x 54
matrix using the Lance and Williams' recurrence formula.  See
pages 86-87 of "[MV] cluster" in the Version 10 [MV] manual.  The
process is then repeated.

Given the number of ties in the original 54 x 54 dissimilarity
matrix, I expect that ties for smallest dissimilarity happen
often during the steps of the algorithm.

Changing the order of the original dissimilarity matrix in this
example will usually result in different tied pairs being
combined at different stages of the algorithm.

Why did Uli notice the difference with Ward's linkage and not
some of the other linkages?  Look at the table on page 87 of the
manual for the recurrence formula.  You will see that alpha_i,
alpha_j, and beta involve the group sizes for group i and group j
(the groups being combined) and group k.  Some of the other
linkages have simpler forms.  I believe this is why the
differences due to ties is more apparent in his Ward's linkage
results.

Uli only showed the summary of the _hgt variables.  Even for
those linkages where his _hgt variable had the same summary (same
mean, min, max, ...), he will probably see that the _hgt and _ord
variables are different (not the summary, but the actual values)
from one run to the other if the order has changed and there are
ties involved.  The _ord variable indicates the order the
clusters are joined in the hierarchy.  This will change depending
on which tie is picked.

Uli also wonders which ordering is "correct".  I don't think that
question has an answer.  It is not a matter of one solution being
correct and the others being incorrect.

The hierarchical clustering methods were not designed to provide
an "optimal" clustering solution.  For that you would have to try
all possibilities (which is not feasible for most size problems)
and if you did try all possiblities, you would most often not end
up with a hierarchical solution (e.g., the optimal 8 group
solution may not nest the optimal 7 group solution).

Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index