Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Dasinger, Lisa" <ldasinger@thezenith.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: cluster analysis validation |
Date | Thu, 19 Apr 2012 13:30:13 -0700 |
Brendan, Thank you very much for the clarification on how to use the -ari- and -permtab- commands. I've been able to use them successfully (as well as carry out your initial suggestion). Thank you for all your help. Lisa Date: Tue, 17 Apr 2012 21:15:34 +0100 From: brendan.halpin@ul.ie (Brendan Halpin) Subject: Re: st: cluster analysis validation On Tue, Apr 17 2012, Dasinger, Lisa wrote: > I've downloaded your programs, and I will try your suggested solution. > What is the input for the ari command? I don't see a help file for it. > Would it be the dataset variable (old vs. new) and cluster group > variable (with values 1..6)? Also, I see that the permtab command seems > to require a square matrix, but I would have a 6 x 2 matrix. Both -ari- and -permtab- compare two classifications of the same size (number of categories). If your two cluster solutions are o6 and n6 . ari o6 n6 will calculate the Adjusted Rand Index for the comparison, and . permtab o6 n6 will rearrange n6 so that it agrees as much as possible with o6, and then cross-tabulate them. If you do -permtab o6 n6, newvar(p6)- it will create in p6 a copy of the rearranged n6. Both of those commands serve to compare the old and the joint classifications, if you wish to do that. My initial suggestion (where the 6x2 table comes into it) is to compare the distribution of the two waves across the joint classification. In the simplest sense, this means examining the percentages within wave, but you could extend it to, say, a multinomial logistic regression (with the cluster solution as the dependent variable) and wave as one of the explanatory variables. If you want more info on the Adjusted Rand Index, there are some notes in the form of comments in the ari.ado file -- its presence in the package was something of an afterthought, so I never set up a help file. ARI works in terms of pairs: if every possible pair of observations that have the same value in one classification have the same value in the other, and every pair that has different values in one has different values in the other, the agreement is perfect and ARI is 1.0. Otherwise the index is less than one. Regards, Brendan PS: Note that since permtab permutes one of the classifications, its runtime rises factorially, to the extent that it is useless from about 10 categories up. -permtabga- estimamtes an approximate solution for larger classifications. - -- Brendan Halpin, Department of Sociology, University of Limerick, Ireland Tel: w +353-61-213147 f +353-61-202569 h +353-61-338562; Room F1-009 x 3147 mailto:brendan.halpin@ul.ie ULSociology on Facebook: http://on.fb.me/fjIK9t http://teaching.sociology.ul.ie/bhalpin/wordpress twitter:@ULSociology Date: Tue, 17 Apr 2012 11:36:07 -0700 From: "Dasinger, Lisa" <ldasinger@thezenith.com> Subject: Re: st: cluster analysis validation Thank you, Brendan, I've downloaded your programs, and I will try your suggested solution. What is the input for the ari command? I don't see a help file for it. Would it be the dataset variable (old vs. new) and cluster group variable (with values 1..6)? Also, I see that the permtab command seems to require a square matrix, but I would have a 6 x 2 matrix. Lisa Date: Mon, 16 Apr 2012 15:57:30 -0700 From: "Dasinger, Lisa" <ldasinger@thezenith.com> Subject: st: cluster analysis validation I've run a cluster analysis in Stata 11.2 based on three continuous variables using -cluster wardslinkage- with the default similarity/dissimilarity measure to generate 6 groups. I'd like to know if there is a way to apply the same cluster analysis to a new set of data. In other words, is there a way to run a new dataset through the old cluster analysis and see how new observations are classified, akin to running a regression equation and then taking a new dataset to obtain out of sample predictions? If so, is there a way to evaluate how well the "old" analysis fits the new data, e.g., by determining how similar/dissimilar each new observation is to the observations in the cluster in which each is placed, and whether the new observation is placed in the "best" cluster, meaning the one that minimizes the distance between the observation and the "center" of the existing cluster? I am new to cluster analysis and am looking for a way to validate the "old" cluster analysis. Lisa Lisa Dasinger, Ph.D. Data Reporting Manager Claims Analytics Zenith Insurance Company Pleasanton Regional Office 4309 Hacienda Drive, Suite 200 Pleasanton, CA 94588 ldasinger@thezenith.com www.TheZenith.com - ------------------------------ Date: Tue, 17 Apr 2012 00:54:03 +0100 From: brendan.halpin@ul.ie (Brendan Halpin) Subject: Re: st: cluster analysis validation On Mon, Apr 16 2012, Dasinger, Lisa wrote: > I've run a cluster analysis in Stata 11.2 based on three continuous > variables using -cluster wardslinkage- with the default > similarity/dissimilarity measure to generate 6 groups. I'd like to know > if there is a way to apply the same cluster analysis to a new set of > data. In other words, is there a way to run a new dataset through the > old cluster analysis and see how new observations are classified, akin > to running a regression equation and then taking a new dataset to obtain > out of sample predictions? I would suggest pooling the two data sets, running a new cluster analysis, and analysing the resultant 6*2 table (cluster classification by old/new). That would test the extent to which the two data sets are similarly distributed across a joint classification. If that's acceptable (and the combined data set is small enough) it is a clean and easy solution. If you are concerned that the joint classification is not compatible with the old classification, you can compare the old cluster membership with the joint cluster membership, for the old data. A good measure of agreement is the Adjusted Rand Index. Comparing cluster solutions is tricky because they don't have "labels" -- there is no way of saying that a given group in classification A is the same as any particular group in classification B, apart from having shared membership. The ARI takes that into account. In theory you can relate the new data to the old classification by calculating the distance from each new observation to the old cluster centroids, but I don't know an easy way of doing that with Stata. Regards, Brendan PS: I have code to estimate the ARI, and to re-arrange cluster solutions to maximise similarity. If you are interested, check out: net from http:teaching.sociology.ul.ie/sadi net install sadi and look at the -ari- and -permtab- commands. - - -- Brendan Halpin, Department of Sociology, University of Limerick, Ireland Tel: w +353-61-213147 f +353-61-202569 h +353-61-338562; Room F1-009 x 3147 mailto:brendan.halpin@ul.ie ULSociology on Facebook: http://on.fb.me/fjIK9t http://teaching.sociology.ul.ie/bhalpin/wordpress twitter:@ULSociology Lisa Dasinger, Ph.D. Data Reporting Manager Claims Analytics Zenith Insurance Company Pleasanton Regional Office 4309 Hacienda Drive, Suite 200 Pleasanton, CA 94588 Phone: 925.416.5235 RightFax: 925.460.1235 Branch: 925.460.0600 ldasinger@thezenith.com www.TheZenith.com *********************************************************** NOTICE: This e-mail, including attachments, contains information that may be confidential, protected by the attorney/client or other privileges, or exempt from disclosure under applicable law. Further, this e-mail may contain information that is proprietary and/or constitutes a trade secret. This e-mail, including attachments, constitutes non-public information intended to be conveyed only to the designated recipient of this communication, please be advised that any disclosure, dissemination, distribution, copying, or other use of this communication or any attached document is strictly prohibited. If you have received this communication in error, please notify the sender immediately by reply e-mail and promptly destroy all electronic and printed copies of this communication and attached documents. *********************************************************** * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/