help matrix dissimilarity
-------------------------------------------------------------------------------
Title
[P] matrix dissimilarity -- Compute similarity or dissimilarity measures
Syntax
matrix dissimilarity matname = [varlist] [if] [in] [, options]
options description
-------------------------------------------------------------------------
measure similarity or dissimilarity measure; default is L2
(Euclidean)
observations compute similarity or dissimilarities between
observations; the default
variables compute similarities or dissimilarities between
variables
names(varname) row/column names for matname (allowed with
observations)
allbinary check that all values are 0, 1, or missing
proportions interpret values as proportions of binary values
dissim(method) change similarity measure to dissimilarity
-------------------------------------------------------------------------
where method transforms similarities to dissimilarities by using
oneminus d_ij = 1 - s_ij
standard d_ij = sqrt(s_ii + s_jj - 2*s_ij)
Description
matrix dissimilarity computes a similarity, dissimilarity, or distance
matrix.
Options
measure specifies one of the similarity or dissimilarity measures allowed
by Stata. The default is L2, Euclidean distance. Many similarity
and dissimilarity measures are provided for continuous data and for
binary data; see [MV] measure_option.
observations and variables specify whether similarities or
dissimilarities are computed between observations or variables. The
default is observations.
names(varname) provides row and column names for matname. varname must
be a string variable with a length of 32 or less. You will want to
pick a varname that yields unique values for the row and column
names. Uniqueness of values is not checked by matrix dissimilarity.
names() is not allowed with the variables option. The default row
and column names when the similarities or dissimilarities are
computed between observations is obs#, where # is the observation
number corresponding to that row or column.
allbinary checks that all values are 0, 1, or missing. Stata treats
nonzero values as one (excluding missing values) when dealing with
what are supposed to be binary data (including binary similarity
measures). allbinary causes matrix dissimilarity to exit with an
error message if the values are not truly binary. allbinary is not
allowed with proportions or the Gower measure.
proportions is for use with binary similarity measures. It specifies
that values be interpreted as proportions of binary values. The
default action treats all nonzero values as one (excluding missing
values). With proportions, the values are confirmed to be between
zero and one, inclusive. See [MV] measure_option for a discussion of
the use of proportions with binary measures. proportions is not
allowed with allbinary or the Gower measure.
dissim(method) specifies that similarity measures be transformed into
dissimilarity measures. method may be oneminus or standard.
oneminus transforms similarities to dissimilarities by using d_ij =
1-s_ij (Kaufman and Rousseeuw 1990, 21). standard uses d_ij =
sqrt(s_ii+s_jj-2*s_ij) (Mardia, Kent, and Bibby 1979, 402). dissim()
does nothing when the measure is already a dissimilarity or distance.
See [MV] measure_option to see which measures are similarities.
Remarks
The similarity or dissimilarity between each observation (or variable if
the variables option is specified) and the others is placed in matname.
The element in the ith row and jth column gives either the similarity or
dissimilarity between the ith and jth observation (or variable). Whether
you get a similarity or a dissimilarity depends upon the requested
measure; see [MV] measure_option.
If there are many observations (variables when the variables option is
specified), you may need to increase the maximum matrix size; see [R]
matsize. If the number of observations (or variables) is so large that
storing the results in a matrix is not practical, you may wish to
consider using the cluster measures command, which stores similarities or
dissimilarities in variables; see [MV] cluster programming utilities.
When computing similarities or dissimilarities between observations, the
default row and column names of matname are set to obs#, where # is the
observation number. The names() option allows you to override this
default. For similarities or dissimilarities between variables, the row
and column names of matname are set to the appropriate variable names.
The order of the rows and columns corresponds with the order of your
observations when you are computing similarities or dissimilarities
between observations. Warning: If you reorder your data (e.g., using
sort or gsort) after running matrix dissimilarity, the row and column
ordering will no longer match your data.
Examples
---------------------------------------------------------------------------
Setup
. webuse labtech
Create matrix De holding the Euclidean distance between all the
observations for variables x1, x2, and x3
. matrix dissimilarity De = x1 x2 x3
List the result
. mat list De
Create matrix Dc holding the Canberra distance between all the
observations for variables x1, x2, and x3
. matrix dis Dc = x1 x2 x3, Canberra
List the result
. mat list Dc
Create matrix Dcvars holding the Canberra distance between all the
variables
. mat dis Dcvars = , Canberra variables
List the result
. mat list Dcvars
---------------------------------------------------------------------------
Setup
. webuse homework
Create matrix M holding the matching coefficient similarity measure
between the last five observations for variables a1 through a5
. mat dis M = a1-a5 in -5/L, matching
List the result
. mat list M
Drop matrix M
. mat drop M
Same as above matrix dissimilarity command, but also verify that the data
are binary
. mat dis M = a1-a5 in -5/L, matching allbinary
---------------------------------------------------------------------------
References
Kaufman, L., and P. J. Rousseeuw. 1990. Finding Groups in Data: An
Introduction to Cluster Analysis. New York: Wiley.
Mardia, K. V., J. T. Kent, and J. M. Bibby. 1979. Multivariate Analysis.
New York: Academic Press.
Also see
Manual: [P] matrix dissimilarity
Help: [MV] measure_option, [P] matrix; [MV] cluster programming
utilities, [MV] cluster, [MV] clustermat, [MV] mdsmat, [MV]
parse_dissim