Jochen Siegele <[email protected]> asks:
> I am involved in preparing a cluster analysis for binary data
> with STATA. Based on the raw data set the similarity matrix for
> the objects is already available.
>
> Is there any option to feed STATA with a (dis-)similarity matrix
> directly, by preventing STATA from interpreting the input matrix
> as raw matrix and preventing the software from calculating a
> (dis-)similarity matrix, before the clustering is started?
>
> The problem has been that the raw data matrix is very large (> 1
> million attributes) and the similarity matrix has been calculated
> by routines outside of STATA. Now when it comes to clustering,
> STATA does not seem to interpret the input matrix as similarity
> matrix and calculates a further distance matrix, before the
> clustering itself is performed. Calculating additionally a
> distance matrix of a similarity matrix may result in
> methodological difficulties.
>
> Therefore my question: Do you have any idea how to deactivate the
> calculation of the distance matrix before the clustering is done?
You mention over 1 million attributes, but do not mention the
number of observations (subjects, companies, ... or whatever).
To make my response more concrete, let me say that there are 200
companies for which you have measured 1 million binary
attributes. Outside of Stata you have created a 200 by 200
matrix of similarities between these 200 companies.
You have imported the 200 by 200 matrix into Stata. It is not
clear from your message whether you have this in a Stata matrix
(see help matrix) or if it is sitting as 200 variables by 200
observations in the Stata active dataset. If it is this later
case, then see help mkmat to turn it into a Stata matrix.
Are you using Stata 8.2 or Stata 9?
In Stata 9 you can use the new -clustermat- command (see help
clustermat) to perform a cluster analysis on a dissimilarity
matrix.
You will first need to change your similarities to
dissimilarities. See for instance page 21 of,
Kaufman L., and P.J. Rousseeuw. 1990. Finding Groups in Data:
An Introduction to Cluster Analysis. Wiley: New York.
The formula given there is
d(i,j) = 1 - s(i,j)
If I had a 200 by 200 matrix of similarities called S I could
create a matrix of dissimilarities called D following the formula
above with
. matrix D = J(200,200,1) - S
I could then pass D to -clustermat-. There is no need to
reexamine the 1 million attributes for each object.
Other transformations from similarities to dissimilarities could
be used. For instance, page 402 of
Mardia K.V., J.T. Kent, and J.M. Bibby. 1979. Multivariate
Analysis. Academic Press.
in the context of Multidimensional Scaling (MDS) uses
d(i,j) = sqrt(s(i,i)-2*s(i,j)+s(j,j))
to transform from similarities to dissimilarities.
The sources I have seen that discuss clustering similarities
first transform the similarities to dissimilarities and then
perform the cluster analysis.
Maybe I have misunderstood your question, but hopefully the
information above is helpful.
Ken Higbee [email protected]
StataCorp 1-800-STATAPC
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/