Stata 15 help for cldis

[MV] measure_option -- Option for similarity and dissimilarity measures

Syntax

command ..., ... measure(measure) ...

or

command ..., ... measure ...

measure Description ------------------------------------------------------------------------- cont_measure similarity or dissimilarity measure for continuous data binary_measure similarity measure for binary data mixed_measure dissimilarity measure for a mix of binary and continuous data -------------------------------------------------------------------------

cont_measure Description ------------------------------------------------------------------------- L2 Euclidean distance (Minkowski with argument 2) Euclidean alias for L2 L(2) alias for L2 L2squared squared Euclidean distance Lpower(2) alias for L2squared L1 absolute-value distance (Minkowski with argument 1) absolute alias for L1 cityblock alias for L1 manhattan alias for L1 L(1) alias for L1 Lpower(1) alias for L1 Linfinity maximum-value distance (Minkowski with infinite argument) maximum alias for Linfinity L(#) Minkowski distance with # argument Lpower(#) Minkowski distance with # argument raised to # power Canberra Canberra distance correlation correlation coefficient similarity measure angular angular separation similarity measure angle alias for angular -------------------------------------------------------------------------

binary_measure Description ------------------------------------------------------------------------- matching simple matching similarity coefficient Jaccard Jaccard binary similarity coefficient Russell Russell and Rao similarity coefficient Hamann Hamann similarity coefficient Dice Dice similarity coefficient antiDice anti-Dice similarity coefficient Sneath Sneath and Sokal similarity coefficient Rogers Rogers and Tanimoto similarity coefficient Ochiai Ochiai similarity coefficient Yule Yule similarity coefficient Anderberg Anderberg similarity coefficient Kulczynski Kulczyński similarity coefficient Pearson Pearson's phi similarity coefficient Gower2 similarity coefficient with same denominator as Pearson -------------------------------------------------------------------------

mixed_measure Description ------------------------------------------------------------------------- Gower Gower's dissimilarity coefficient -------------------------------------------------------------------------

Description

Several commands have options that allow you to specify a similarity or dissimilarity measure designated as measure in the syntax; see [MV] cluster, [MV] mds, [MV] discrim knn, and [MV] matrix dissimilarity. These options are documented here. Most analysis commands (for example, cluster and mds) transform similarity measures to dissimilarity measures as needed.

Options

Measures are divided into those for continuous data and binary data. measure is not case sensitive. Full definitions are presented in Similarity and dissimilarity measures for continuous data, Similarity measures for binary data, and Dissimilarity measures for mixed data.

The similarity or dissimilarity measure is most often used to determine the similarity or dissimilarity between observations. However, sometimes the similarity or dissimilarity between variables is of interest.

Similarity and dissimilarity measures for continuous data

Here are the similarity and dissimilarity measures for continuous data available in Stata. In the following formulas, p represents the number of variables, N is the number of observations, and x_iv denotes the value of observation i for variable v. See [MV] measure_option for the formulas for the similarity and dissimilarity measures between variables (not presented here).

L2 (aliases Euclidean and L(2)) requests the Minkowski distance metric with argument 2

sqrt(sum((x_ia - x_ja)^2))

L2 is best known as Euclidean distance and is the default dissimilarity measure for discrim knn, mds, matrix dissimilarity, and all the cluster subcommands except for centroidlinkage, medianlinkage, and wardslinkage, which default to using L2squared; see [MV] discrim knn, [MV] mds, [MV] matrix dissimilarity, and [MV] cluster.

L2squared (alias Lpower(2)) requests the square of the Minkowski distance metric with argument 2

sum((x_ia - x_ja)^2)

L2squared is best known as squared Euclidean distance and is the default dissimilarity measure for the centroidlinkage, medianlinkage, and wardslinkage subcommands of cluster; see [MV] cluster.

L1 (aliases absolute, cityblock, manhattan, and L(1)) requests the Minkowski distance metric with argument 1

sum(|x_ia - x_ja|)

which is best known as absolute-value distance.

Linfinity (alias maximum) requests the Minkowski distance metric with infinite argument

max(|x_ia - x_ja|)

and is best known as maximum-value distance.

L(#) requests the Minkowski distance metric with argument #:

(sum(|x_ia - x_ja|^#)^(1/#) # >= 1

We discourage using extremely large values for #. Because the absolute value of the difference is being raised to the value of #, depending on the nature of your data, you could experience numeric overflow or underflow. With a large value of #, the L() option will produce results similar to those of the Linfinity option. Use the numerically more stable Linfinity option instead of a large value for # in the L() option.

See Anderberg (1973) for a discussion of the Minkowski metric and its special cases.

Lpower(#) requests the Minkowski distance metric with argument #, raised to the # power:

sum(|x_ia - x_ja|^#) # >= 1

As with L(#), we discourage using extremely large values for #; see the discussion above.

Canberra requests the following distance metric

sum(|x_ia - x_ja|/(|x_ia|+|x_ja|))

which ranges from 0 to p, the number of variables. Gordon (1999) explains that the Canberra distance is sensitive to small changes near zero.

correlation requests the correlation coefficient similarity measure,

sum((x_ia-xbar_i.)(x_ja-xbar_j.)) ---------------------------------------------- sqrt(sum(x_ia-xbar_i.)^2 * sum(x_jb-xbar_j.)^2)

where xbar_i. = sum(x_ia)/p.

The correlation similarity measure takes values between -1 and 1. With this measure, the relative direction of the two vectors is important. The correlation similarity measure is related to the angular separation similarity measure (described next). The correlation similarity measure gives the cosine of the angle between the two vectors measured from the mean; see Gordon (1999).

angular (alias angle) requests the angular separation similarity measure

sum(x_ia * x_ja)/sqrt(sum(x_ia^2) * sum(x_jb^2))

which is the cosine of the angle between the two vectors measured from zero and takes values from -1 to 1; see Gordon (1999).

Similarity measures for binary data

Similarity measures for binary data are based on the four values from the cross-tabulation of observation i and j (when comparing observations) or variables u and v (when comparing variables).

For comparing observations i and j, the cross-tabulation is

| obs. j | 1 0 -------+------- obs. 1 | a b i 0 | c d

a is the number of variables where observations i and j both had ones, and d is the number of variables where observations i and j both had zeros. The number of variables where observation i is one and observation j is zero is b, and the number of variables where observation i is zero and observation j is one is c.

See [MV] measure_option to see a similar table for comparison between variables.

Stata treats nonzero values as one when a binary value is expected. Specifying one of the binary similarity measures imposes this behavior unless some other option overrides it (for instance, the allbinary option of [MV] matrix dissimilarity). See [MV] measure_option for a discussion of binary similarity measures applied to averages.

The following binary similarity coefficients are available. Unless stated otherwise, the similarity measures range from 0 to 1.

matching requests the simple matching (Zubin 1938, Sokal and Michener 1958) binary similarity coefficient

(a+d)/(a+b+c+d)

which is the proportion of matches between the 2 observations or variables.

Jaccard requests the Jaccard (1901, 1908) binary similarity coefficient

a/(a+b+c)

which is the proportion of matches when at least one of the vectors had a one. If both vectors are all zeros, this measure is undefined. Stata then declares the answer to be one, meaning perfect agreement. This is a reasonable choice for most applications and will cause an all-zero vector to have similarity of one only with another all-zero vector. In all other cases, an all-zero vector will have Jaccard similarity of zero to the other vector.

The Jaccard coefficient was discovered earlier by Gilbert (1884).

Russell requests the Russell and Rao (1940) binary similarity coefficient

a/(a+b+c+d)

Hamann requests the Hamann (1961) binary similarity coefficient

((a+d)-(b+c))/(a+b+c+d)

which is the number of agreements minus disagreements divided by the total. The Hamann coefficient ranges from -1, perfect disagreement, to 1, perfect agreement. The Hamann coefficient is equal to twice the simple matching coefficient minus 1.

Dice requests the Dice binary similarity coefficient

2a/(2a+b+c)

suggested by Czekanowski (1932), Dice (1945), and Sørensen (1948). The Dice coefficient is similar to the Jaccard similarity coefficient but gives twice the weight to agreements. Like the Jaccard coefficient, the Dice coefficient is declared by Stata to be one if both vectors are all zero, thus avoiding the case where the formula is undefined.

antiDice requests the binary similarity coefficient

a/(a+2(b+c))

which is credited to Anderberg (1973) but was shown earlier by Sokal and Sneath (1963, 129). We did not call this the Anderberg coefficient because there is another coefficient better known by that name; see the Anderberg option. The name antiDice is our creation. This coefficient takes the opposite view from the Dice coefficient and gives double weight to disagreements. As with the Jaccard and Dice coefficients, the anti-Dice coefficient is declared to be one if both vectors are all zeros.

Sneath requests the Sneath and Sokal (1962) binary similarity coefficient

2(a+d)/{2(a+d)+(b+c)}

which is similar to the simple matching coefficient but gives double weight to matches. Also compare the Sneath and Sokal coefficient with the Dice coefficient, which differs only in whether it includes d.

Rogers requests the Rogers and Tanimoto (1960) binary similarity coefficient

(a+d)/{(a+d)+2(b+c)}

which takes the opposite approach from the Sneath and Sokal coefficient and gives double weight to disagreements. Also compare the Rogers and Tanimoto coefficient with the anti-Dice coefficient, which differs only in whether it includes d.

Ochiai requests the Ochiai (1957) binary similarity coefficient

a/sqrt((a+b)(a+c))

The formula for the Ochiai coefficient is undefined when one or both of the vectors being compared are all zeros. If both are all zeros, Stata declares the measure to be one, and if only one of the two vectors is all zeros, the measure is declared to be zero.

The Ochiai coefficient was presented earlier by Driver and Kroeber (1932).

Yule requests the Yule (see Yule [1900] and Yule and Kendall [1950]) binary similarity coefficient

(ad-bc)/(ad+bc)

which ranges from -1 to 1. The formula for the Yule coefficient is undefined when one or both of the vectors are either all zeros or all ones. Stata declares the measure to be 1 when b+c = 0, meaning that there is complete agreement. Stata declares the measure to be -1 when a+d = 0, meaning that there is complete disagreement. Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These rules, applied before using the Yule formula, avoid the cases where the formula would produce an undefined result.

Anderberg requests the Anderberg binary similarity coefficient

(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4

The Anderberg coefficient is undefined when one or both vectors are either all zeros or all ones. This difficulty is overcome by first applying the rule that if both vectors are all ones (or both vectors are all zeros), the similarity measure is declared to be one. Otherwise, if any of the marginal totals (a+b, a+c, c+d, b+d) are zero, then the similarity measure is declared to be zero.

Though this similarity coefficient is best known as the Anderberg coefficient, it appeared earlier in Sokal and Sneath (1963, 130).

Kulczynski requests the Kulczyński (1927) binary similarity coefficient

(a/(a+b) + a/(a+c))/2

The formula for this measure is undefined when one or both of the vectors are all zeros. If both vectors are all zeros, Stata declares the similarity measure to be one. If only one of the vectors is all zeros, the similarity measure is declared to be zero.

Pearson requests Pearson's (1900) phi binary similarity coefficient

(ad-bc)/sqrt((a+b)(a+c)(d+b)(d+c))

which ranges from -1 to 1. The formula for this coefficient is undefined when one or both of the vectors are either all zeros or all ones. Stata declares the measure to be 1 when b+c = 0, meaning that there is complete agreement. Stata declares the measure to be -1 when a+d = 0, meaning that there is complete disagreement. Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These rules, applied before using Pearson's phi coefficient formula, avoid the cases where the formula would produce an undefined result.

Gower2 requests the binary similarity coefficient

ad/sqrt((a+b)(a+c)(d+b)(d+c))

which is presented by Gower (1985) but appeared earlier in Sokal and Sneath (1963, 130). Stata uses the name Gower2 to avoid confusion with the better-known Gower coefficient, which is used with a mix of binary and continuous data.

The formula for this similarity measure is undefined when one or both of the vectors are all zeros or all ones. This is overcome by first applying the rule that if both vectors are all ones (or both vectors are all zeros) then the similarity measure is declared to be one. Otherwise, if ad = 0, the similarity measure is declared to be zero.

Dissimilarity measure for mixed data

Here is one measure that works with a mix of binary and continuous data. Binary variables are those containing only zeros, ones, and missing values; all other variables are continuous. The formulas below are for the dissimilarity between observations; see [MV] measure_option for the formulas for the dissimilarity between variables (not presented here).

Gower requests the Gower (1971) dissimilarity coefficient for a mix of binary and continuous variables

sum(delta_ijv*d_ijv)/sum(delta_ijv)

where delta_ijv is a binary indicator equal to 1 whenever both observations i and j are nonmissing for variable v, and zero otherwise. Observations with missing values are not included when using cluster or mds, and so if an observation is included, delta_ijv = 1 and sum(delta_ijv) is the number of variables. However, using matrix dissimilarity with the Gower option does not exclude observations with missing values. See [MV] cluster, [MV] mds, and [MV] matrix dissimilarity.

For binary variables v,

d_ijv = 0 if x_iv=x_jv = 1 otherwise

This is the same as the matching measure.

For continuous variables v,

d_ijv = |x_iv - x_jv|/(max_k(x_kv)-min_k(x_kv))

d_ijv is set to 0 if (max_k(x_kv)-min_k(x_kv)) is zero, that is, if the range of the variable is zero. This is the L1 measure divided by the range of the variable.

The Gower measure interprets binary variables as those with only 0, 1, or missing values. All other variables are treated as continuous.

In [MV] matrix dissimilarity, missing observations are included only in the calculation of the Gower dissimilarity, but the formula for this dissimilarity measure is undefined when all the values of delta_ijv or delta_iuv are zero. The dissimilarity is then set to missing.

Technical note

Normally the commands

. matrix dissimilarity gm = x1 x2 y1, Gower . clustermat waverage gm, add

and

. cluster waverage x1 x2 y1, measure(Gower)

will yield the same results, and likewise with mdsmat and mds. However, if any of the variables contain missing observations, this will not be the case. cluster and mds exclude all observations that have missing values for any of the variables of interest, whereas matrix dissimilarity with the Gower option does not. See [MV] cluster, [MV] mds, and [MV] matrix dissimilarity for more information.

Note: matrix dissimilarity without the Gower option does exclude all observations that have missing values for any of the variables of interest.

References

Anderberg, M. R. 1973. Cluster Analysis for Applications New York: Academic Press.

Czekanowski, J. 1932. "Coefficient of racial likeness" und "durchschnittliche Differenz". Anthropologischer Anzeiger 9: 227-249.

Dice, L. R. 1945. Measures of the amount of ecologic associate between species. Ecology 26: 297-302.

Driver, H. E., and A. L. Kroeber. 1932. Quantitative expression of cultural relationships. University of California Publications in American Archaeology and Ethnology 31: 211-256.

Gilbert, G. K. 1884. Finley's tornado predictions. American Meteorological Journal 1: 166-172.

Gordon, A. D. 1999. Classification. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC.

Gower, J. C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27: 857-871.

------. 1985. Measures of similarity, dissimilarity, and distance. In Vol. 5 of Encyclopedia of Statistical Sciences, ed. S. Kotz, N. L. Johnson, and C. B. Read, 397-405. New York: Wiley.

Hamann, U. 1961. Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Ein Beitrag zum System der Monokotyledonen. Willdenowia 2: 639-768.

Jaccard, P. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37: 241-272.

------. 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des Sciences Naturelles 44: 223-270.

Kulczyński, S. 1927. Die Pflanzenassoziationen der Pieninen [In Polish, German summary]. Bulletin International de l'Academie Polonaise des Sciences et des Lettres, Classe des Sciences Mathematiques et Naturelles, B (Sciences Naturelles) Suppl. II: 57-203.

Ochiai, A. 1957. Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions [in Japanese, English summary]. Bulletin of the Japanese Society of Scientific Fisheries 22: 526-530.

Pearson, K. 1900. Mathematical contributions to the theory of evolution -- VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, Series A 195: 1-47.

Rogers, D. J., and T. T. Tanimoto. 1960. A computer program for classifying plants. Science 132: 1115-1118.

Russell, P. F., and T. R. Rao. 1940. On habitat and association of species of anopheline larvae in south-eastern Madras. Journal of the Malaria Institute of India 3: 153-178.

Sneath, P. H. A., and R. R. Sokal. 1962. Numerical taxonomy. Nature 193: 855-860.

Sokal, R. R., and C. D. Michener. 1958. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 28: 1409-1438.

Sokal, R. R., and P. H. A. Sneath. 1963. Principles of Numerical Taxonomy. San Francisco: Freeman.

Sørensen, T. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Royal Danish Academy of Sciences and Letters, Biological Series 5: 1-34.

Yule, G. U. 1900. On the association of attributes in statistics: With illustrations from the material of the Childhood Society, etc. Philosophical Transactions of the Royal Society, Series A 194: 257-319.

Yule, G. U., and M. G. Kendall. 1950. An Introduction to the Theory of Statistics. 14th ed. New York: Hafner.

Zubin, J. 1938. A technique for measuring like-mindedness. Journal of Abnormal and Social Psychology 33: 508-516.


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index