# Re: st: Cluster analysis -similarity and dissimilarity measures

 From khigbee@stata.com To statalist@hsphsun2.harvard.edu Subject Re: st: Cluster analysis -similarity and dissimilarity measures Date Tue, 24 Oct 2006 14:46:32 -0500

```Bellinda Kallimanis <bkallimanis@fmhi.usf.edu> asked about doing
a cluster analysis on ordinal data.  I provided some preliminary
information in a previous response.  I would like to expand and
clarify in this posting.

In my previous response I showed one method of dealing with
categorical (nominal) (and sometimes ordinal) data in cluster
analysis by using a binary (dis)similarity measure on the
indicator or dummy variables produced from the categorical
variables.  I did not go into any details about the pros and
cons of this method.

Anderberg (1973, pp. 117-118) says:

"A number of different schemes have been proposed for coding
nominal and ordinal variables in terms of binary variables so
that the distinctions of multiple classes can be retained while
enjoying the advantages of binary measures.  These schemes are
rather plausible sounding and frequently are offered without
qualification.

Anderberg warns concerning some of these, including the simple
coding method (i.e. using the dummy variables) and the additive
coding method (that I will explain below) as follows:

"... their uses are quite limited and necessarily require
certain precautions to retain any degree of rationality in the
results."

Reading further it is clear that you would pick measures such as
Jaccard or Dice (where the 0-0 matches do not matter) over
measures such as matching (where the 0-0 matches do matter).  In
my quick example I arbitrarily used matching.  You will probably
want to avoid that measure.

The next method shown by Anderberg (1973, pp. 119-120) for
when creating 0-1 variables from the ordinal variable.

For example with a variable v with 4 ordinal levels you would
create three binary variables (b1, b2, b3) using the following
mapping.

ordinal | dummy binary variables
class   |     b1   b2   b3
--------+-----------------------
1     |     0    0    0
2     |     1    0    0
3     |     1    1    0
4     |     1    1    1

With additive coding you probably do want to include the 0-0
matches so that the matching coefficient might be applicable
here, while you would probably avoid measures such as Jaccard and
Dice.

One way to produce b1, b2, and b3 from v in Stata is

gen byte b1 = v >= 2
gen byte b2 = v >= 3
gen byte b3 = v >= 4
replace b1 = . if missing(v)
replace b2 = . if missing(v)
replace b3 = . if missing(v)

If you had several variables and/or several ordinal levels you
could take care of this in a loop.

Additional guidance and correction factors are discussed by
Anderberg.  I will not go into them here.

In my original response I indicated that if you found a
dissimilarity measure for ordinal data in the literature that you
could use Mata to produce that dissimilarity matrix for your data
and then pass that along to -clustermat- to do your cluster
analysis.  I want to illustrate that here.

Spath (1980, p. 31) indicates one possible distance function to
use for ordinal data (Spath cites Soergel, 1967 for this).

sum(x_k) + sum(y_k) - 2*sum(min(x_k,y_k))
d(x,y) = -----------------------------------------
sum(x_k) + sum(y_k) - sum(min(x_k,y_k))

x and y are two observation row vectors, and k in the sum()s goes
from 1 to the number of ordinal variables.

Here is the Stata and Mata code implementing this distance
measure and performing a single-linkage cluster analysis on 4
ordinal variables each with 5 levels.

// Get some data
. sysuse auto, clear
. keep in 11/30
. xtile xlen = length, n(5)
. xtile xturn = turn, n(5)
. xtile xtrunk = trunk, n(5)

// The variables rep78, xlen, xturn, and xtrunk are
// ordinal with 5 levels

// Use mata to produce the distance matrix
. mata:
: void function myorddist(string varlist, string Dmat)
> {
>     real matrix Dist
>     real matrix Data
>
>     V = st_varindex(tokens(varlist))
>     Data = J(1,1,0)
>     st_view(Data,.,V)
>     Dist = J(rows(Data), rows(Data),0)
>     rsum = rowsum(Data)
>     for(i=1; i<=rows(Data); i++) {
>         for(j=1; j<=i; j++){
>             minsum = rowsum(colmin(Data[i,.]\Data[j,.]))
>             Dist[i,j] = (rsum[i,1] + rsum[j,1] - 2 * minsum) :/ ///
>                         (rsum[i,1] + rsum[j,1] - minsum)
>             Dist[j,i] = Dist[i,j]
>         }
>     }
>     st_matrix(Dmat, Dist)
> }
: end

. mata: myorddist("rep78 xlen xturn xtrunk", "myD")

// Show the upper corner of myD
. matlist myD[1..5,1..5]

// Do the cluster analysis
. cluster tree

I have not fully tested the Mata code, but I did do some spot
checking of several of the entries in myD and they agreed with my
hand calculation for the distance formula.  I did not add
functionality for dealing with -if- or -in- conditions for
selecting which observations to use etc.  Those extras would not

Hopefully this worked out example will help Bellinda (and others)
who want to perform cluster analysis on ordinal data using a
dissimilarity of their choosing that is not already a part of
official Stata.

References:

Anderberg, Michael R. 1973. "Cluster Analysis For

Soergel, D. 1967. "Mathematical analysis of documentation
systems".  Inform.  Stor. Retr. 3:129-173.

Spath, Helmuth. 1980. "Cluster Analysis Algorithms".  Ellis
Horwood Ltd.

Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```