Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: cluster analysis validation

From	"Dasinger, Lisa" <[email protected]>
To	[email protected]
Subject	Re: st: cluster analysis validation
Date	Tue, 17 Apr 2012 11:36:07 -0700

Thank you, Brendan,

I've downloaded your programs, and I will try your suggested solution.
What is the input for the ari command?  I don't see a help file for it.
Would it be the dataset variable (old vs. new) and cluster group
variable (with values 1..6)?  Also, I see that the permtab command seems
to require a square matrix, but I would have a 6 x 2 matrix.

Lisa

Date: Mon, 16 Apr 2012 15:57:30 -0700
From: "Dasinger, Lisa" <[email protected]>
Subject: st: cluster analysis validation

I've run a cluster analysis in Stata 11.2 based on three continuous
variables using -cluster wardslinkage- with the default
similarity/dissimilarity measure to generate 6 groups.  I'd like to know
if there is a way to apply the same cluster analysis to a new set of
data.  In other words, is there a way to run a new dataset through the
old cluster analysis and see how new observations are classified, akin
to running a regression equation and then taking a new dataset to obtain
out of sample predictions?  

If so, is there a way to evaluate how well the "old" analysis fits the
new data, e.g., by determining how similar/dissimilar each new
observation is to the observations in the cluster in which each is
placed, and whether the new observation is placed in the "best" cluster,
meaning the one that minimizes the distance between the observation and
the "center" of the existing cluster?

I am new to cluster analysis and am looking for a way to validate the
"old" cluster analysis.  

Lisa

Lisa Dasinger, Ph.D.

Data Reporting Manager
Claims Analytics

Zenith Insurance Company
Pleasanton Regional Office
4309 Hacienda Drive, Suite 200
Pleasanton, CA 94588

[email protected]

www.TheZenith.com

------------------------------

Date: Tue, 17 Apr 2012 00:54:03 +0100
From: [email protected] (Brendan Halpin)
Subject: Re: st: cluster analysis validation

On Mon, Apr 16 2012, Dasinger, Lisa wrote:

> I've run a cluster analysis in Stata 11.2 based on three continuous
> variables using -cluster wardslinkage- with the default
> similarity/dissimilarity measure to generate 6 groups.  I'd like to
know
> if there is a way to apply the same cluster analysis to a new set of
> data.  In other words, is there a way to run a new dataset through the
> old cluster analysis and see how new observations are classified, akin
> to running a regression equation and then taking a new dataset to
obtain
> out of sample predictions?  

I would suggest pooling the two data sets, running a new cluster
analysis, and analysing the resultant 6*2 table (cluster classification
by old/new). That would test the extent to which the two data sets are
similarly distributed across a joint classification. If that's
acceptable (and the combined data set is small enough) it is a clean and
easy solution.

If you are concerned that the joint classification is not compatible
with the old classification, you can compare the old cluster membership
with the joint cluster membership, for the old data. A good measure of
agreement is the Adjusted Rand Index. Comparing cluster solutions is
tricky because they don't have "labels" -- there is no way of saying
that a given group in classification A is the same as any particular
group in classification B, apart from having shared membership. The ARI
takes that into account.

In theory you can relate the new data to the old classification by
calculating the distance from each new observation to the old cluster
centroids, but I don't know an easy way of doing that with Stata.

Regards,

Brendan

PS: I have code to estimate the ARI, and to re-arrange cluster solutions
to maximise similarity. If you are interested, check out:

   net from http:teaching.sociology.ul.ie/sadi
   net install sadi

and look at the -ari- and -permtab- commands.
- -- 
Brendan Halpin,   Department of Sociology,   University of Limerick,
Ireland
Tel: w +353-61-213147  f +353-61-202569  h +353-61-338562;  Room F1-009
x 3147
mailto:[email protected]    ULSociology on Facebook:
http://on.fb.me/fjIK9t
http://teaching.sociology.ul.ie/bhalpin/wordpress
twitter:@ULSociology

***********************************************************
NOTICE:
This e-mail, including attachments, contains information
that may be confidential, protected by the attorney/client
or other privileges, or exempt from disclosure under
applicable law.  Further, this e-mail may contain
information that is proprietary and/or constitutes a trade
secret.  This e-mail, including attachments, constitutes 
non-public information intended to be conveyed only to the
designated recipient of this communication, please be
advised that any disclosure, dissemination, distribution,
copying, or other use of this communication or any attached
document is strictly prohibited.  If you have received this
communication in error, please notify the sender
immediately by reply e-mail and promptly destroy all
electronic and printed copies of this communication and
attached documents.

***********************************************************

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: cluster analysis validation
  - From: [email protected] (Brendan Halpin)

Prev by Date: Re: st: Testing for parallel slopes when working with survey commands
Next by Date: Re: st: Date: Tue, 17 Apr 2012 18:33:16 +0200
Previous by thread: Re: st: cluster analysis validation
Next by thread: Re: st: cluster analysis validation
Index(es):
- Date
- Thread