**[MV] cluster kmeans and kmedians** -- Kmeans and kmedians cluster analysis

__Syntax__

Kmeans cluster analysis

**cluster** __k__**means** [*varlist*] [*if*] [*in*] **,** **k(***#***)** [ *options* ]

Kmedians cluster analysis

**cluster** __kmed__**ians** [*varlist*] [*if*] [*in*] **,** **k(***#***)** [ *options* ]

*options* Description
-------------------------------------------------------------------------
Main
* **k(***#***)** perform cluster analysis resulting in # groups
__mea__**sure(***measure***)** similarity or dissimilarity measure; default is
**L2** (Euclidean)
__n__**ame(***clname***)** name of resulting cluster analysis

Options
__s__**tart(***start_option**)* obtain *k* initial group centers by using
*start_option*
__keep__**centers** append the *k* final group means or medians to the
data

Advanced
__gen__**erate(***groupvar***)** name of grouping variable
__iter__**ate(***#***)** maximum number of iterations; default is
**iterate(10000)**
-------------------------------------------------------------------------
* **k(***#***)** is required.

__Menu__

__cluster kmeans__

**Statistics > Multivariate analysis > Cluster analysis > Cluster data**
**>** **Kmeans**

__cluster kmedians__

**Statistics > Multivariate analysis > Cluster analysis > Cluster data**
**>** **Kmedians**

__Description__

**cluster kmeans** and **cluster kmedians** perform kmeans and kmedians partition
cluster analysis, respectively. See **[MV] cluster** for a listing of the
**cluster** commands.

__Options__

+------+
----+ Main +-------------------------------------------------------------

**k(***#***)** is required and indicates that *#* groups are to be formed by the
cluster analysis.

**measure(***measure***)** specifies the similarity or dissimilarity measure. The
default is **measure(L2)**, Euclidean distance. This option is not case
sensitive. See **[MV]** *measure_option* for detailed descriptions of the
supported measures.

**name(***clname***)** specifies the name to attach to the resulting cluster
analysis. If **name()** is not specified, Stata finds an available
cluster name, displays it for your reference, and attaches the name
to your cluster analysis.

+---------+
----+ Options +----------------------------------------------------------

**start(***start_option***)** indicates how the *k* initial group centers are to be
obtained. The available *start_option*s are

__kr__**andom**[**(***seed#***)**], the default, specifies that *k* unique observations
be chosen at random, from among those to be clustered, as
starting centers for the *k* groups. Optionally, a random-number
seed may be specified to cause the command **set seed** *seed#* (see
**[R] set seed**) to be applied before the *k* random observations are
chosen.

__f__**irstk**[**,** __ex__**clude**] specifies that the first *k* observations from among
those to be clustered be used as the starting centers for the *k*
groups. With the **exclude** option, these first *k* observations are
not included among the observations to be clustered.

__l__**astk**[**,** __ex__**clude**] specifies that the last *k* observations from among
those to be clustered be used as the starting centers for the *k*
groups. With the **exclude** option, these last *k* observations are
then not included among the observations to be clustered.

__r__**andom**[**(***seed#***)**] specifies that *k* random initial group centers be
generated. The values are randomly chosen from a uniform
distribution over the range of the data. Optionally, a
random-number seed may be specified to cause the command **set seed**
*seed#* (see **[R] set seed**) to be applied before the *k* group centers
are generated.

__pr__**andom**[**(***seed#***)**] specifies that *k* partitions be formed randomly among
the observations to be clustered. The group means or medians
from the *k* groups defined by this partitioning are to be used as
the starting group centers. Optionally, a random-number seed may
be specified to cause the command **set seed** *seed#* (see **[R] set**
**seed**) to be applied before the *k* partitions are chosen.

__everyk__**th** specifies that *k* partitions be formed by assigning
observations 1, 1+*k*, 1+2*k*, ... to the first group; assigning
observations 2, 2+*k*, 2+2*k*, ... to the second group; and so on, to
form *k* groups. The group means or medians from these *k* groups
are to be used as the starting group centers.

__seg__**ments** specifies that *k* nearly equal partitions be formed from the
data. Approximately the first **N**/*k* observations are assigned to
the first group, the second **N**/*k* observations are assigned to the
second group, and so on. The group means or medians from these *k*
groups are to be used as the starting group centers.

__g__**roup(***varname***)** provides an initial grouping variable, *varname*, that
defines *k* groups among the observations to be clustered. The
group means or medians from these *k* groups are to be used as the
starting group centers.

**keepcenters** specifies that the group means or medians from the *k* groups
that are produced are to be appended to the data.

+----------+
----+ Advanced +---------------------------------------------------------

**generate(***groupvar***)** provides the name of the grouping variable to be
created by **cluster kmeans** or **cluster kmedians**. By default, this will
be the name specified in **name()**.

**iterate(***#***)** specifies the maximum number of iterations to allow in the
kmeans or kmedians clustering algorithm. The default is
**iterate(10000)**.

__Examples__

Setup
**. webuse labtech**

Perform kmeans cluster analysis, creating eight groups
**. cluster kmeans x1 x2 x3 x4, k(8)**

Same as above, but using absolute-value distance instead of Euclidian
distance, naming cluster analysis **k8abs**
**. cluster kmeans x1 x2 x3 x4, k(8) measure(L1) name(k8abs)**

Perform kmedians cluster analysis, creating six groups by using the
Canberra distance metric
**. cluster kmedians x1 x2 x3 x4, k(6) measure(Canberra)**

Create six groups, using the first 6 observations in the dataset as
starting centers
**. cluster kmedians x1 x2 x3 x4, k(6) start(firstk)**

Same as above, but do not include the first 6 observations in the cluster
analysis
**. cluster kmedians x1 x2 x3 x4, k(6) start(firstk, exclude)**