Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: cluster analysis validation


From   brendan.halpin@ul.ie (Brendan Halpin)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: cluster analysis validation
Date   Tue, 17 Apr 2012 00:54:03 +0100

On Mon, Apr 16 2012, Dasinger, Lisa wrote:

> I've run a cluster analysis in Stata 11.2 based on three continuous
> variables using -cluster wardslinkage- with the default
> similarity/dissimilarity measure to generate 6 groups.  I'd like to know
> if there is a way to apply the same cluster analysis to a new set of
> data.  In other words, is there a way to run a new dataset through the
> old cluster analysis and see how new observations are classified, akin
> to running a regression equation and then taking a new dataset to obtain
> out of sample predictions?  

I would suggest pooling the two data sets, running a new cluster
analysis, and analysing the resultant 6*2 table (cluster classification
by old/new). That would test the extent to which the two data sets are
similarly distributed across a joint classification. If that's
acceptable (and the combined data set is small enough) it is a clean and
easy solution.

If you are concerned that the joint classification is not compatible
with the old classification, you can compare the old cluster membership
with the joint cluster membership, for the old data. A good measure of
agreement is the Adjusted Rand Index. Comparing cluster solutions is
tricky because they don't have "labels" -- there is no way of saying
that a given group in classification A is the same as any particular
group in classification B, apart from having shared membership. The ARI
takes that into account.

In theory you can relate the new data to the old classification by
calculating the distance from each new observation to the old cluster
centroids, but I don't know an easy way of doing that with Stata.


Regards,

Brendan


PS: I have code to estimate the ARI, and to re-arrange cluster solutions
to maximise similarity. If you are interested, check out:

   net from http:teaching.sociology.ul.ie/sadi
   net install sadi

and look at the -ari- and -permtab- commands.
-- 
Brendan Halpin,   Department of Sociology,   University of Limerick,   Ireland
Tel: w +353-61-213147  f +353-61-202569  h +353-61-338562;  Room F1-009 x 3147
mailto:brendan.halpin@ul.ie    ULSociology on Facebook: http://on.fb.me/fjIK9t
http://teaching.sociology.ul.ie/bhalpin/wordpress         twitter:@ULSociology
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index