Stata 11 help for matrix dissimilarity

help matrix dissimilarity -------------------------------------------------------------------------------

Title

[P] matrix dissimilarity -- Compute similarity or dissimilarity measures

Syntax

matrix dissimilarity matname = [varlist] [if] [in] [, options]

options description ------------------------------------------------------------------------- measure similarity or dissimilarity measure; default is L2 (Euclidean) observations compute similarity or dissimilarities between observations; the default variables compute similarities or dissimilarities between variables names(varname) row/column names for matname (allowed with observations) allbinary check that all values are 0, 1, or missing proportions interpret values as proportions of binary values dissim(method) change similarity measure to dissimilarity -------------------------------------------------------------------------

where method transforms similarities to dissimilarities by using

oneminus d_ij = 1 - s_ij standard d_ij = sqrt(s_ii + s_jj - 2*s_ij)

Description

matrix dissimilarity computes a similarity, dissimilarity, or distance matrix.

Options

measure specifies one of the similarity or dissimilarity measures allowed by Stata. The default is L2, Euclidean distance. Many similarity and dissimilarity measures are provided for continuous data and for binary data; see [MV] measure_option.

observations and variables specify whether similarities or dissimilarities are computed between observations or variables. The default is observations.

names(varname) provides row and column names for matname. varname must be a string variable with a length of 32 or less. You will want to pick a varname that yields unique values for the row and column names. Uniqueness of values is not checked by matrix dissimilarity. names() is not allowed with the variables option. The default row and column names when the similarities or dissimilarities are computed between observations is obs#, where # is the observation number corresponding to that row or column.

allbinary checks that all values are 0, 1, or missing. Stata treats nonzero values as one (excluding missing values) when dealing with what are supposed to be binary data (including binary similarity measures). allbinary causes matrix dissimilarity to exit with an error message if the values are not truly binary. allbinary is not allowed with proportions or the Gower measure.

proportions is for use with binary similarity measures. It specifies that values be interpreted as proportions of binary values. The default action treats all nonzero values as one (excluding missing values). With proportions, the values are confirmed to be between zero and one, inclusive. See [MV] measure_option for a discussion of the use of proportions with binary measures. proportions is not allowed with allbinary or the Gower measure.

dissim(method) specifies that similarity measures be transformed into dissimilarity measures. method may be oneminus or standard. oneminus transforms similarities to dissimilarities by using d_ij = 1-s_ij (Kaufman and Rousseeuw 1990, 21). standard uses d_ij = sqrt(s_ii+s_jj-2*s_ij) (Mardia, Kent, and Bibby 1979, 402). dissim() does nothing when the measure is already a dissimilarity or distance. See [MV] measure_option to see which measures are similarities.

Remarks

The similarity or dissimilarity between each observation (or variable if the variables option is specified) and the others is placed in matname. The element in the ith row and jth column gives either the similarity or dissimilarity between the ith and jth observation (or variable). Whether you get a similarity or a dissimilarity depends upon the requested measure; see [MV] measure_option.

If there are many observations (variables when the variables option is specified), you may need to increase the maximum matrix size; see [R] matsize. If the number of observations (or variables) is so large that storing the results in a matrix is not practical, you may wish to consider using the cluster measures command, which stores similarities or dissimilarities in variables; see [MV] cluster programming utilities.

When computing similarities or dissimilarities between observations, the default row and column names of matname are set to obs#, where # is the observation number. The names() option allows you to override this default. For similarities or dissimilarities between variables, the row and column names of matname are set to the appropriate variable names.

The order of the rows and columns corresponds with the order of your observations when you are computing similarities or dissimilarities between observations. Warning: If you reorder your data (e.g., using sort or gsort) after running matrix dissimilarity, the row and column ordering will no longer match your data.

Examples

--------------------------------------------------------------------------- Setup . webuse labtech

Create matrix De holding the Euclidean distance between all the observations for variables x1, x2, and x3 . matrix dissimilarity De = x1 x2 x3

List the result . mat list De

Create matrix Dc holding the Canberra distance between all the observations for variables x1, x2, and x3 . matrix dis Dc = x1 x2 x3, Canberra

List the result . mat list Dc

Create matrix Dcvars holding the Canberra distance between all the variables . mat dis Dcvars = , Canberra variables

List the result . mat list Dcvars

--------------------------------------------------------------------------- Setup . webuse homework

Create matrix M holding the matching coefficient similarity measure between the last five observations for variables a1 through a5 . mat dis M = a1-a5 in -5/L, matching

List the result . mat list M

Drop matrix M . mat drop M

Same as above matrix dissimilarity command, but also verify that the data are binary . mat dis M = a1-a5 in -5/L, matching allbinary ---------------------------------------------------------------------------

References

Kaufman, L., and P. J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.

Mardia, K. V., J. T. Kent, and J. M. Bibby. 1979. Multivariate Analysis. New York: Academic Press.

Also see

Manual: [P] matrix dissimilarity

Help: [MV] measure_option, [P] matrix; [MV] cluster programming utilities, [MV] cluster, [MV] clustermat, [MV] mdsmat, [MV] parse_dissim


© Copyright 1996–2009 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index