help cluster
-------------------------------------------------------------------------------
Title
[MV] cluster -- Introduction to cluster-analysis commands
Syntax
Cluster analysis of data
cluster subcommand ...
Cluster analysis of a dissimilarity matrix
clustermat subcommand ...
Description
Stata's cluster-analysis routines provide several hierarchical and
partition clustering methods, postclustering summarization methods, and
cluster-management tools. This entry presents an overview of the cluster
and clustermat commands (also see [MV] clustermat), as well as Stata's
cluster-analysis management tools. The hierarchical clustering methods
may be applied to the data by using the cluster command or to a
user-supplied dissimilarity matrix by using the clustermat command.
The cluster command has the following subcommands, which are detailed in
their respective manual entries.
Topic cluster subcommand
-----------------------------------------------------------------
Partition-clustering methods for observations
(see [MV] cluster kmeans and kmedians)
Kmeans cluster kmeans
Kmedians cluster kmedians
Hierarchical clustering methods for observations
(see [MV] cluster linkage)
Single linkage cluster singlelinkage
Average linkage cluster averagelinkage
Complete linkage cluster completelinkage
Weighted-average linkage cluster waveragelinkage
Median linkage cluster medianlinkage
Centroid linkage cluster centroidlinkage
Ward's linkage cluster wardslinkage
Postclustering commands
Stopping rules cluster stop
Dendrograms (cluster trees) cluster dendrogram
(synonym: cluster tree)
Generate summary variables cluster generate
User utilities
Cluster notes cluster notes
Other user utilities
(see [MV] cluster utility)
cluster dir
cluster list
cluster drop
cluster use
cluster rename
cluster renamevar
Programmer utilities
(see [MV] cluster programming)
cluster query
cluster set
cluster delete
cluster parsedistance
cluster measures
(Dis)similarity measures
(see [MV] measure_option)
-----------------------------------------------------------------
The clustermat command has the following subcommands, which are detailed
along with the related cluster command in the cluster linkage help file.
Also see [MV] clustermat.
Topic clustermat subcommand
-----------------------------------------------------------------
Hierarchical clustering of a dissimilarity matrix
(see [MV] cluster linkage)
Single linkage clustermat singlelinkage
Complete linkage clustermat completelinkage
Average linkage clustermat averagelinkage
Weighted average linkage clustermat waveragelinkage
Median linkage clustermat medianlinkage
Centroid linkage clustermat centroidlinkage
Ward's linkage clustermat wardslinkage
-----------------------------------------------------------------
Partition-clustering methods for observations
(see [MV] cluster kmeans and kmedians; help cluster kmeans and kmedians)
Kmeans cluster analysis
cluster kmeans [varlist] [if] [in] , k(#) [ options ]
Kmedians cluster analysis
cluster kmedians [varlist] [if] [in] , k(#) [ options ]
options description
-------------------------------------------------------------------------
Main
* k(#) perform cluster analysis resulting in # groups
measure(measure) similarity or dissimilarity measure; default is
L2 (Euclidean)
name(clname) name of resulting cluster analysis
Options
start(start_option) obtain k initial group centers by using
start_option; see Options for details
keepcenters append the k final group means or medians to the
data
Advanced
generate(groupvar) name of grouping variable
iterate(#) maximum number of iterations; default is
iterate(10000)
-------------------------------------------------------------------------
* k(#) is required.
Hierarchical clustering for observations
(see [MV] cluster linkage; help cluster linkage)
cluster linkage [varlist] [if] [in] [, options ]
linkage description
-------------------------------------------------------------------------
singlelinkage single-linkage cluster analysis
averagelinkage average-linkage cluster analysis
completelinkage complete-linkage cluster analysis
waveragelinkage weighted-average linkage cluster analysis
medianlinkage median-linkage cluster analysis
centroidlinkage centroid-linkage cluster analysis
wardslinkage Ward's linkage cluster analysis
-------------------------------------------------------------------------
cluster_options description
-------------------------------------------------------------------------
Main
measure(measure) similarity or dissimilarity measure
name(clname) name of resulting cluster analysis
Advanced
generate(stub) prefix for generated variables; default prefix is
clname
-------------------------------------------------------------------------
Postclustering commands
Stopping rules (see [MV] cluster stop;
help cluster stop)
Cluster stopping rules for hierarchical clustering are provided with the
cluster stop command.
Cluster analysis of data
cluster stop [clname] [, options]
Cluster analysis of a dissimilarity matrix
clustermat stop [clname] , variables(varlist) [options]
options description
-------------------------------------------------------------------------
rule(calinski) use Calinski/Harabasz pseudo-F index
stopping rule; the default
rule(duda) use Duda/Hart Je(2)/Je(1) index stopping
rule
* rule(rule_name) use rule_name stopping rule
groups(numlist) compute stopping rule for specified groups
matrix(matname) save the results in matrix matname
+ variables(varlist) compute the stopping rule using varlist
-------------------------------------------------------------------------
* rule(rule_name) is not shown in the dialog box. See [MV] cluster
programming subroutines for information on how to add stopping rules to
the cluster stop command.
+ variables(varlist) is required with a clustermat solution and optional
with a cluster solution.
Dendrograms (see [MV] cluster dendrogram;
help cluster dendrogram)
cluster dendrogram [clname] [if] [in] [, options ]
options description
-------------------------------------------------------------------------
Main
quick do not center parent branches
labels(varname) name of variable containing leaf labels
cutnumber(#) display top # branches only
cutvalue(#) display branches above # (dis)similarity measure
only
showcount display number of observations for each branch
countprefix(string) prefix the branch count with string; default is
``n=''
countsuffix(string) suffix the branch count with string; default is
empty string
countinline put branch count inline with branch label
vertical orient dendrogram vertically (default)
horizontal orient dendrogram horizontally
Plot
line_options affect rendition of the plotted lines
Add plots
addplot(plot) add other plots to the dendrogram
Y axis, X axis, Titles, Legend, Overall
twoway_options any option other than by() documented in
[G] twoway_options
-------------------------------------------------------------------------
Note: cluster tree is a synonym for cluster dendrogram.
In addition to the restrictions imposed by if and in, the observations
are automatically restricted to those that were used in the cluster
analysis.
Generate summary variables (see [MV] cluster generate;
help cluster generate)
The cluster generate command generates summary or grouping variables
after a cluster analysis. The groups() function generates variables
indicating cluster membership into the specified number(s) of clusters
after a hierarchical cluster analysis. The cut() function generates a
variable indicating cluster membership based on cutting the dendrogram at
the specified (dis)similarity value.
Generate grouping variables for specified numbers of clusters
cluster generate { newvar | stub } = groups(numlist) [, options ]
Generate grouping variable by cutting the dendrogram
cluster generate newvar = cut(#) [, name(clname) ]
options description
-------------------------------------------------------------------------
name(clname) name of cluster analysis to use in producing new
variables
ties(error) produce error message for ties; default
ties(skip) ignore requests that result in ties
ties(fewer) produce results for largest number of groups smaller than
your request
ties(more) produce results for smallest number of groups larger than
your request
-------------------------------------------------------------------------
User utilities
Cluster notes (see [MV] cluster notes;
help cluster notes)
The cluster notes command provides the ability to add, view, and delete
notes for a cluster analysis.
Add a note to a cluster analysis
cluster notes clname : text
List all cluster notes
cluster notes
List cluster notes associated with specified cluster analysis
cluster notes clnamelist
Drop cluster notes
cluster notes drop clname [in numlist]
User utilities (see [MV] cluster utility;
help cluster utility)
cluster user utility subcommands allow you to view and manipulate cluster
objects. cluster dir gives a directory-style listing of the currently
defined clusters. cluster list gives a detailed listing of clusters.
cluster drop removes the named clusters. cluster use marks a cluster
analysis as the most recent one. cluster rename allows the renaming of a
cluster analysis. cluster renamevar properly renames variables attached
to a cluster analysis.
Directory-style listing of currently defined clusters
cluster dir
Detailed listing of clusters
cluster list [clnamelist] [, list_options ]
Drop the named clusters
cluster drop { clnamelist | _all }
Mark a cluster analysis as the most recent one
cluster use clname
Rename a cluster
cluster rename oldclname newclname
Rename variables attached to a cluster
cluster renamevar oldvarname newvarname [, name(clname) ]
cluster renamevar oldstub newstub , prefix [ name(clname) ]
list_options description
-------------------------------------------------------------------------
Options
notes list cluster notes
type list cluster analysis type
method list cluster analysis method
dissimilarity list cluster analysis dissimilarity measure
similarity list cluster analysis similarity measure
vars list variable names attached to the cluster analysis
chars list any characteristics attached to the cluster
analysis
other list any "other" information
* all list all items and information attached to the
cluster; the default
-------------------------------------------------------------------------
* all is not shown in the dialog box.
Programmer utilities (see [MV] cluster programming utilities;
help cluster programming)
The query, set, and delete subcommands of cluster provide programmers a
method of obtaining, setting, and deleting the underlying information and
structures of a cluster analysis. The parsedistance subcommand provides
parsing of distance options for programmers. The measures subcommand
computes (dis)similarities.
Obtain various attributes of a cluster analysis
cluster query [clname]
Set various attributes of a cluster analysis
cluster set [clname] [, set_options ]
Delete attributes from a cluster analysis
cluster delete clname [, delete_options ]
Check similarity and dissimilarity measure name
cluster parsedistance measure
Compute similarity and dissimilarity measure
cluster measures varlist [if] [in] , compare(numlist)
generate(newvarlist) [measures_options]
set_options description
-------------------------------------------------------------------------
addname add clname to the master list of cluster
analyses
type(type) set the cluster type for clname
method(method) set the name of the clustering method for the
cluster analysis
similarity(measure) set the name of the similarity measure used for
the cluster analysis
dissimilarity(measure) set the name of the dissimilarity measure used
for the cluster analysis
var(tag varname) set tag that points to varname
char(tag charname) set tag that points to charname
other(tag text) set tag with text attached to the tag marker
note(text) add a note to the clname
-------------------------------------------------------------------------
delete_options description
-------------------------------------------------------------------------
zap delete all possible settings for clname
delname remove clname from the master list of current
cluster analysis
type delete the cluster type entry from clname
method delete the cluster method entry from clname
dissimilarity delete the dissimilarity entries from clname
similarity delete the similarity entries from clname
notes(numlist) delete the specified numbered notes from clname
allnotes remove all notes from clname
var(tag) remove tag from clname
allvars remove all the entries pointing to variables
for clname
varzap(tag) same as var(), but also delete the referenced
variable
allvarzap same as allvars, but also delete the variables
char(tag) remove tag that points to a Stata
characteristic from clname
allchars remove all entries pointing to Stata
characteristics for clname
charzap(tag) same as char(), but also delete the
characteristic
allcharzap same as allchars, but also delete the
characteristics
other(tag) delete tag and its associated text from clname
allothers delete all entries from clname that have been
set using other()
-------------------------------------------------------------------------
measures_options description
-------------------------------------------------------------------------
* compare(numlist) use numlist as the comparison observations
* generate(newvarlist) generate newvarlist variables
measure (dis)similarity measure; default is L2
propvars interpret observations implied by if and in as
proportions of binary observations
propcompares interpret comparison observations as
proportions of binary observations
-------------------------------------------------------------------------
* compare(numlist) and generate(newvarlist) are required.
(Dis)similarity measures (see [MV] measure_option;
help measure_option)
Measures are divided into those for continuous data and binary data.
measure capitalization does not matter. Full definitions are presented
in the Continuous measure definitions and Binary measure definitions
sections.
measure description
-------------------------------------------------------------------------
cont_measure similarity or dissimilarity measure for continuous data
binary_measure similarity measure for binary data
mixed_measure dissimilarity measure for a mix of binary and
continuous data
-------------------------------------------------------------------------
cont_measure description
-------------------------------------------------------------------------
L2 Euclidean distance (Minkowski with argument 2)
Euclidean alias for L2
L(2) alias for L2
L2squared squared Euclidean distance
Lpower(2) alias for L2squared
L1 absolute-value distance (Minkowski with argument 1)
absolute alias for L1
cityblock alias for L1
manhattan alias for L1
L(1) alias for L1
Lpower(1) alias for L1
Linfinity maximum-value distance (Minkowski with infinite
argument)
maximum alias for Linfinity
L(#) Minkowski distance with # argument
Lpower(#) Minkowski distance with # argument raised to # power
Canberra Canberra distance
correlation correlation coefficient similarity measure
angular angular separation similarity measure
angle alias for angular
-------------------------------------------------------------------------
binary_measure description
-------------------------------------------------------------------------
matching simple matching similarity coefficient
Jaccard Jaccard binary similarity coefficient
Russell Russell and Rao similarity coefficient
Hamann Hamann similarity coefficient
Dice Dice similarity coefficient
antiDice anti-Dice similarity coefficient
Sneath Sneath and Sokal similarity coefficient
Rogers Rogers and Tanimoto similarity coefficient
Ochiai Ochiai similarity coefficient
Yule Yule similarity coefficient
Anderberg Anderberg similarity coefficient
Kulczynski Kulczynski similarity coefficient
Pearson Pearson's phi similarity coefficient
Gower2 similarity coefficient with same denominator as Pearson
-------------------------------------------------------------------------
mixed_measure description
-------------------------------------------------------------------------
Gower Gower's dissimilarity coefficient
-------------------------------------------------------------------------
Hierarchical clustering of a dissimilarity matrix
(see [MV] cluster linkage; help cluster linkage)
clustermat linkage [varlist] [if] [in] [, clustermat_opts ]
linkage description
-------------------------------------------------------------------------
singlelinkage single-linkage cluster analysis
averagelinkage average-linkage cluster analysis
completelinkage complete-linkage cluster analysis
waveragelinkage weighted-average linkage cluster analysis
medianlinkage median-linkage cluster analysis
centroidlinkage centroid-linkage cluster analysis
wardslinkage Ward's linkage cluster analysis
-------------------------------------------------------------------------
clustermat_options description
-------------------------------------------------------------------------
Main
shape(shape) shape (storage method) of matname
add add cluster information to data currently in
memory
clear replace data in memory with cluster information
labelvar(varname) place dissimilarity matrix row names in varname
name(clname) name of resulting cluster analysis
Advanced
force perform clustering after fixing matname problems
generate(stub) prefix for generated variables
-------------------------------------------------------------------------
shape matname is stored as a
-------------------------------------------------------------------------
full square symmetric matrix; the default
lower vector of rowwise lower triangle (with diagonal)
llower vector of rowwise strict lower triangle (no
diagonal)
upper vector of rowwise upper triangle (with diagonal)
uupper vector of rowwise strict upper triangle (no
diagonal)
-------------------------------------------------------------------------
Also see
Manual: [MV] cluster
Help: [MV] clustermat, [MV] cluster programming utilities