Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Cluster analysis -similarity and dissimilarity measures


From   "Kallimanis, Bellinda" <bkallimanis@fmhi.usf.edu>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Cluster analysis -similarity and dissimilarity measures
Date   Tue, 24 Oct 2006 17:06:14 -0400

Hi Ken, 

Thank you for this, I had actually been trying the coverted rank
solution, which wasn't working too brilliantly for me and was just in
the process of looking for a measure for ordinal variables when your
email came through. 

The code works perfectly. Thank you for taking the time to write this
code as it is beyond my programming skills, though I think I have learnt
a few things from this!! And thank you Nick Cox for you suggestions. 

Kind Regards, 
Bellinda



-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of
khigbee@stata.com
Sent: Tuesday, October 24, 2006 3:47 PM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Cluster analysis -similarity and dissimilarity measures

Bellinda Kallimanis <bkallimanis@fmhi.usf.edu> asked about doing
a cluster analysis on ordinal data.  I provided some preliminary
information in a previous response.  I would like to expand and
clarify in this posting.

In my previous response I showed one method of dealing with
categorical (nominal) (and sometimes ordinal) data in cluster
analysis by using a binary (dis)similarity measure on the
indicator or dummy variables produced from the categorical
variables.  I did not go into any details about the pros and
cons of this method.

Anderberg (1973, pp. 117-118) says:

  "A number of different schemes have been proposed for coding
  nominal and ordinal variables in terms of binary variables so
  that the distinctions of multiple classes can be retained while
  enjoying the advantages of binary measures.  These schemes are
  rather plausible sounding and frequently are offered without
  qualification.

Anderberg warns concerning some of these, including the simple
coding method (i.e. using the dummy variables) and the additive
coding method (that I will explain below) as follows:

  "... their uses are quite limited and necessarily require
  certain precautions to retain any degree of rationality in the
  results."

Reading further it is clear that you would pick measures such as
Jaccard or Dice (where the 0-0 matches do not matter) over
measures such as matching (where the 0-0 matches do matter).  In
my quick example I arbitrarily used matching.  You will probably
want to avoid that measure.

The next method shown by Anderberg (1973, pp. 119-120) for
ordinal data is to use additive coding instead of simple coding
when creating 0-1 variables from the ordinal variable.

For example with a variable v with 4 ordinal levels you would
create three binary variables (b1, b2, b3) using the following
mapping.

     ordinal | dummy binary variables
     class   |     b1   b2   b3
     --------+-----------------------
       1     |     0    0    0
       2     |     1    0    0
       3     |     1    1    0
       4     |     1    1    1

With additive coding you probably do want to include the 0-0
matches so that the matching coefficient might be applicable
here, while you would probably avoid measures such as Jaccard and
Dice.

One way to produce b1, b2, and b3 from v in Stata is

    gen byte b1 = v >= 2
    gen byte b2 = v >= 3
    gen byte b3 = v >= 4
    replace b1 = . if missing(v)
    replace b2 = . if missing(v)
    replace b3 = . if missing(v)

If you had several variables and/or several ordinal levels you
could take care of this in a loop.

Additional guidance and correction factors are discussed by
Anderberg.  I will not go into them here.

In my original response I indicated that if you found a
dissimilarity measure for ordinal data in the literature that you
could use Mata to produce that dissimilarity matrix for your data
and then pass that along to -clustermat- to do your cluster
analysis.  I want to illustrate that here.

Spath (1980, p. 31) indicates one possible distance function to
use for ordinal data (Spath cites Soergel, 1967 for this).

             sum(x_k) + sum(y_k) - 2*sum(min(x_k,y_k))
    d(x,y) = -----------------------------------------
             sum(x_k) + sum(y_k) - sum(min(x_k,y_k))

x and y are two observation row vectors, and k in the sum()s goes
from 1 to the number of ordinal variables.

Here is the Stata and Mata code implementing this distance
measure and performing a single-linkage cluster analysis on 4
ordinal variables each with 5 levels.

    // Get some data
    . sysuse auto, clear
    . keep in 11/30
    . xtile xlen = length, n(5)
    . xtile xturn = turn, n(5)
    . xtile xtrunk = trunk, n(5)

    // The variables rep78, xlen, xturn, and xtrunk are
    // ordinal with 5 levels

    // Use mata to produce the distance matrix
    . mata:
    : void function myorddist(string varlist, string Dmat)
    > {
    >     real matrix Dist
    >     real matrix Data
    > 
    >     V = st_varindex(tokens(varlist))
    >     Data = J(1,1,0)
    >     st_view(Data,.,V)
    >     Dist = J(rows(Data), rows(Data),0)
    >     rsum = rowsum(Data)
    >     for(i=1; i<=rows(Data); i++) {
    >         for(j=1; j<=i; j++){
    >             minsum = rowsum(colmin(Data[i,.]\Data[j,.]))
    >             Dist[i,j] = (rsum[i,1] + rsum[j,1] - 2 * minsum) :/
///
    >                         (rsum[i,1] + rsum[j,1] - minsum)
    >             Dist[j,i] = Dist[i,j]
    >         }
    >     }
    >     st_matrix(Dmat, Dist)
    > }                       
    : end

    . mata: myorddist("rep78 xlen xturn xtrunk", "myD")

    // Show the upper corner of myD
    . matlist myD[1..5,1..5]

    // Do the cluster analysis
    . clustermat singlelink myD, add
    . cluster tree

I have not fully tested the Mata code, but I did do some spot
checking of several of the entries in myD and they agreed with my
hand calculation for the distance formula.  I did not add
functionality for dealing with -if- or -in- conditions for
selecting which observations to use etc.  Those extras would not
be hard to add.

Hopefully this worked out example will help Bellinda (and others)
who want to perform cluster analysis on ordinal data using a
dissimilarity of their choosing that is not already a part of
official Stata.


References:

  Anderberg, Michael R. 1973. "Cluster Analysis For
      Applications". Academic Press.

  Soergel, D. 1967. "Mathematical analysis of documentation
      systems".  Inform.  Stor. Retr. 3:129-173.

  Spath, Helmuth. 1980. "Cluster Analysis Algorithms".  Ellis
      Horwood Ltd.


Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index