[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
khigbee@stata.com |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Cluster analysis with STATA |

Date |
Fri, 17 Jun 2005 15:01:59 -0500 |

Jochen Siegele <jochen.siegele@web.de> asks: > I am involved in preparing a cluster analysis for binary data > with STATA. Based on the raw data set the similarity matrix for > the objects is already available. > > Is there any option to feed STATA with a (dis-)similarity matrix > directly, by preventing STATA from interpreting the input matrix > as raw matrix and preventing the software from calculating a > (dis-)similarity matrix, before the clustering is started? > > The problem has been that the raw data matrix is very large (> 1 > million attributes) and the similarity matrix has been calculated > by routines outside of STATA. Now when it comes to clustering, > STATA does not seem to interpret the input matrix as similarity > matrix and calculates a further distance matrix, before the > clustering itself is performed. Calculating additionally a > distance matrix of a similarity matrix may result in > methodological difficulties. > > Therefore my question: Do you have any idea how to deactivate the > calculation of the distance matrix before the clustering is done? You mention over 1 million attributes, but do not mention the number of observations (subjects, companies, ... or whatever). To make my response more concrete, let me say that there are 200 companies for which you have measured 1 million binary attributes. Outside of Stata you have created a 200 by 200 matrix of similarities between these 200 companies. You have imported the 200 by 200 matrix into Stata. It is not clear from your message whether you have this in a Stata matrix (see help matrix) or if it is sitting as 200 variables by 200 observations in the Stata active dataset. If it is this later case, then see help mkmat to turn it into a Stata matrix. Are you using Stata 8.2 or Stata 9? In Stata 9 you can use the new -clustermat- command (see help clustermat) to perform a cluster analysis on a dissimilarity matrix. You will first need to change your similarities to dissimilarities. See for instance page 21 of, Kaufman L., and P.J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley: New York. The formula given there is d(i,j) = 1 - s(i,j) If I had a 200 by 200 matrix of similarities called S I could create a matrix of dissimilarities called D following the formula above with . matrix D = J(200,200,1) - S I could then pass D to -clustermat-. There is no need to reexamine the 1 million attributes for each object. Other transformations from similarities to dissimilarities could be used. For instance, page 402 of Mardia K.V., J.T. Kent, and J.M. Bibby. 1979. Multivariate Analysis. Academic Press. in the context of Multidimensional Scaling (MDS) uses d(i,j) = sqrt(s(i,i)-2*s(i,j)+s(j,j)) to transform from similarities to dissimilarities. The sources I have seen that discuss clustering similarities first transform the similarities to dissimilarities and then perform the cluster analysis. Maybe I have misunderstood your question, but hopefully the information above is helpful. Ken Higbee khigbee@stata.com StataCorp 1-800-STATAPC * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: installing ReLogit** - Next by Date:
**Re: st: installing ReLogit** - Previous by thread:
**st: Cluster analysis with STATA** - Next by thread:
**RE: st: RE: -clemao_io- broken in Stata 9? - Addendum** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |