Matissa Hollister

statalist@hsphsun2.harvard.edu

st: Cluster, new dissimilarity measures, and sequence analysis

Wed, 21 Jul 2004

Im hoping someone can help me solve this problem, although I’m beginning to think that it’s hopeless. Basically I’ve created my own special measure of dissimilarity that I want to use for clustering, but I’m finding that there is no way to get Stata to allow me to use this new dissimilarity measure. Any ideas of ways to get around this problem would be greatly appreciated. Basically, I am using a procedure called Optimal Matching, an algorithm designed to create a measure of dissimilarity between two sequences of data. I am using it to identify people who have similar career patterns. I’ve created a do-file that accomplishes the most difficult and unusual part of Optimal Matching, which is creating the measure of dissimilarity between each pair of sequences. I now want to run a clustering procedure to identify groups based upon this dissimilarity measure. I found a post in the listserv archives (dated November 18, 2002) where someone wanted to do something similar (she wanted to create a geographic distance measure). From the response I gather the calling and running of the dissimilarity algorithms occurs within the built-in stata command _cluster and is done within C, which is certainly beyond my programming abilities. I’ve contemplated several possibilities and would love help or advice on any of them: 1)find a different software program that will allow me to easily input a new dissimilarity measure into a cluster command (preferably not expensive) 2)a way to alter Stata’s cluster command to allow for this new dissimilarity measure 3)a way to get around this problem, e.g.: A.use the ParseDist command within cluster.ado to somehow cause the built-in command to call up a different distance command B.ways to enter the data so that a built-in Stata dissimilarity measure will result in the same pairwise distances (difficult because the pairwise dissimilarities make up a multi-dimensional space, the whole point is that they are difficult to summarize in a few variables) 4) write my own clustering procedure Please! Any help would be gratefully accepted. I know that several other researchers have already used Optimal Matching with clustering, so my guess is that option #1 might be the most viable one, but I’m not sure where to look. Thanks, Matissa __________________________________ Do you Yahoo!? Vote for the stars of Yahoo!'s next ad campaign! http://advision.webevents.yahoo.com/yahoo/votelifeengine/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

