Re: st: Cluster analysis -similarity and dissimilarity measures

 From khigbee@stata.com To statalist@hsphsun2.harvard.edu Subject Re: st: Cluster analysis -similarity and dissimilarity measures Date Tue, 24 Oct 2006 10:57:18 -0500

```Bellinda Kallimanis <bkallimanis@fmhi.usf.edu> asks:

> I have a question about the measure options for conducting hierarchical
> cluster analysis. I see that there are measures for continuous and
> binary variables, but what about ordinal variables? Is there a measure
> available in stata that I can use and just can't find? Or should I do
> some sort of standardization of the variables? The variables have 4
> categories.

Stata does not have built in similarity/dissimilarity measures
designed for ordinal data for use in -cluster-.

Nick Cox <n.j.cox@durham.ac.uk> responded with some useful
guidance, including the suggestion of converting the data to
ranks or "ridits" (type -findit ridit- in Stata to find a user
written -egen- function for ridits) and then (presumably) using
one of the continuous measures on these ranks or ridits.

Another approach is to turn your data into binary data and use
one of the binary measures.  (This is better justified if your
data were mearly categorical instead of ordinal -- though some
do it with ordinal also.)

One way to create the binary variables is with the -gen()- option
of the -tabulate- command.  Lets say I had three categorical
variables v1, v2, and v3 each containing values 1, 2, 3, and 4.
The following

quietly tabulate v1, gen(b_v1_)
quietly tabulate v2, gen(b_v2_)
quietly tabulate v3, gen(b_v3_)

creates variables

b_v1_1, b_v1_2, b_v1_3, b_v1_4,
b_v2_1, b_v2_2, b_v2_3, b_v2_4,
b_v3_1, b_v3_2, b_v3_3, b_v3_4,
b_v4_1, b_v4_2, b_v4_3, and b_v4_4

You might want to put the -quietly tabulate- commands in a loop
if you have very many variables (for example 30 in the loop
below).

forvalues i = 1/30 {
quietly tabulate v`i', gen(b_v`i'_)
}

After generating the binary variables you can then send them in
to -cluster-

And, you can pick some other binary measure besides -matching-.

If you do not want to use a continuous measure or a binary
measure, then find in the literature a measure designed for
ordinal data that you wish to use and create a similarity or
dissimilarity matrix for your data based on the formula for the
particular measure you found.

After you have come up with a similarity or dissimilarity matrix,
use -clustermat- to do the cluster analysis.  See Example 2 (page
85 of the Version 9 manual) of "[MV] clustermat" for an example
of doing something like this (it isn't an ordinal measure, but
instead a continuous measure not provided directly by Stata).
Also the FAQ

http://www.stata.com/support/faqs/mata/matsize.html

may be helpful to you.  You will find Stata's -mata- matrix
facilities (see -help mata-) to be the most flexible way of
producing your similarity or dissimilarity matrix.

Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```