**[MV]** *measure_option* -- Option for similarity and dissimilarity measures

__Syntax__

*command* ...**,** ... __mea__**sure(***measure***)** ...

or

*command* ...**,** ... *measure* ...

*measure* Description
-------------------------------------------------------------------------
*cont_measure* similarity or dissimilarity measure for continuous data
*binary_measure* similarity measure for binary data
*mixed_measure* dissimilarity measure for a mix of binary and
continuous data
-------------------------------------------------------------------------

*cont_measure* Description
-------------------------------------------------------------------------
**L2** Euclidean distance (Minkowski with argument 2)
__Euc__**lidean** alias for **L2**
**L(2)** alias for **L2**
**L2squared** squared Euclidean distance
**Lpower(2)** alias for **L2squared**
**L1** absolute-value distance (Minkowski with argument 1)
__abs__**olute** alias for **L1**
__cityb__**lock** alias for **L1**
__manhat__**tan** alias for **L1**
**L(1)** alias for **L1**
**Lpower(1)** alias for **L1**
__Linf__**inity** maximum-value distance (Minkowski with infinite
argument)
__max__**imum** alias for **Linfinity**
**L(***#***)** Minkowski distance with *#* argument
__Lpow__**er(***#***)** Minkowski distance with *#* argument raised to *#* power
__Canb__**erra** Canberra distance
__corr__**elation** correlation coefficient similarity measure
__ang__**ular** angular separation similarity measure
__ang__**le** alias for **angular**
-------------------------------------------------------------------------

*binary_measure* Description
-------------------------------------------------------------------------
__match__**ing** simple matching similarity coefficient
__Jac__**card** Jaccard binary similarity coefficient
__Russ__**ell** Russell and Rao similarity coefficient
**Hamann** Hamann similarity coefficient
**Dice** Dice similarity coefficient
**antiDice** anti-Dice similarity coefficient
**Sneath** Sneath and Sokal similarity coefficient
**Rogers** Rogers and Tanimoto similarity coefficient
**Ochiai** Ochiai similarity coefficient
**Yule** Yule similarity coefficient
__Ander__**berg** Anderberg similarity coefficient
__Kulc__**zynski** Kulczyński similarity coefficient
**Pearson** Pearson's phi similarity coefficient
**Gower2** similarity coefficient with same denominator as **Pearson**
-------------------------------------------------------------------------

*mixed_measure* Description
-------------------------------------------------------------------------
**Gower** Gower's dissimilarity coefficient
-------------------------------------------------------------------------

__Description__

Several commands have options that allow you to specify a similarity or
dissimilarity measure designated as *measure* in the syntax; see **[MV]**
**cluster**, **[MV] mds**, **[MV] discrim knn**, and **[MV] matrix dissimilarity**.
These options are documented here. Most analysis commands (for example,
**cluster** and **mds**) transform similarity measures to dissimilarity measures
as needed.

__Options__

Measures are divided into those for continuous data and binary data.
*measure* is not case sensitive. Full definitions are presented in
*Similarity and dissimilarity measures for continuous data*, *Similarity*
*measures for binary data*, and *Dissimilarity measures for mixed data*.

The similarity or dissimilarity measure is most often used to determine
the similarity or dissimilarity between observations. However, sometimes
the similarity or dissimilarity between variables is of interest.

__Similarity and dissimilarity measures for continuous data__

Here are the similarity and dissimilarity measures for continuous data
available in Stata. In the following formulas, p represents the number
of variables, N is the number of observations, and x_iv denotes the value
of observation i for variable v. See **[MV]** *measure_option* for the
formulas for the similarity and dissimilarity measures between variables
(not presented here).

**L2** (aliases **Euclidean** and **L(2)**)
requests the Minkowski distance metric with argument 2

sqrt(sum((x_ia - x_ja)^2))

**L2** is best known as Euclidean distance and is the default
dissimilarity measure for **discrim knn**, **mds**, **matrix dissimilarity**, and
all the **cluster** subcommands except for **centroidlinkage**,
**medianlinkage**, and **wardslinkage**, which default to using **L2squared**;
see **[MV] discrim knn**, **[MV] mds**, **[MV] matrix dissimilarity**, and **[MV]**
**cluster**.

**L2squared** (alias **Lpower(***2***)**)
requests the square of the Minkowski distance metric with argument 2

sum((x_ia - x_ja)^2)

**L2squared** is best known as squared Euclidean distance and is the
default dissimilarity measure for the **centroidlinkage**, **medianlinkage**,
and **wardslinkage** subcommands of **cluster**; see **[MV] cluster**.

**L1** (aliases **absolute**, **cityblock**, **manhattan**, and **L(1)**)
requests the Minkowski distance metric with argument 1

sum(|x_ia - x_ja|)

which is best known as absolute-value distance.

**Linfinity** (alias **maximum**)
requests the Minkowski distance metric with infinite argument

max(|x_ia - x_ja|)

and is best known as maximum-value distance.

**L(***#***)**
requests the Minkowski distance metric with argument *#*:

(sum(|x_ia - x_ja|^*#*)^(1/*#*) *#* >= 1

We discourage using extremely large values for *#*. Because the
absolute value of the difference is being raised to the value of *#*,
depending on the nature of your data, you could experience numeric
overflow or underflow. With a large value of *#*, the **L()** option will
produce results similar to those of the **Linfinity** option. Use the
numerically more stable **Linfinity** option instead of a large value for
*#* in the **L()** option.

See Anderberg (1973) for a discussion of the Minkowski metric and its
special cases.

**Lpower(***#***)**
requests the Minkowski distance metric with argument *#*, raised to the
*#* power:

sum(|x_ia - x_ja|^*#*) *#* >= 1

As with **L(***#***)**, we discourage using extremely large values for *#*; see
the discussion above.

**Canberra**
requests the following distance metric

sum(|x_ia - x_ja|/(|x_ia|+|x_ja|))

which ranges from 0 to p, the number of variables. Gordon (1999)
explains that the Canberra distance is sensitive to small changes
near zero.

**correlation**
requests the correlation coefficient similarity measure,

sum((x_ia-xbar_i.)(x_ja-xbar_j.))
----------------------------------------------
sqrt(sum(x_ia-xbar_i.)^2 * sum(x_jb-xbar_j.)^2)

where xbar_i. = sum(x_ia)/p.

The correlation similarity measure takes values between -1 and 1.
With this measure, the relative direction of the two vectors is
important. The correlation similarity measure is related to the
angular separation similarity measure (described next). The
correlation similarity measure gives the cosine of the angle between
the two vectors measured from the mean; see Gordon (1999).

**angular** (alias **angle**)
requests the angular separation similarity measure

sum(x_ia * x_ja)/sqrt(sum(x_ia^2) * sum(x_jb^2))

which is the cosine of the angle between the two vectors measured
from zero and takes values from -1 to 1; see Gordon (1999).

__Similarity measures for binary data__

Similarity measures for binary data are based on the four values from the
cross-tabulation of observation i and j (when comparing observations) or
variables u and v (when comparing variables).

For comparing observations i and j, the cross-tabulation is

| obs. j
| 1 0
-------+-------
obs. 1 | a b
i 0 | c d

a is the number of variables where observations i and j both had ones,
and d is the number of variables where observations i and j both had
zeros. The number of variables where observation i is one and
observation j is zero is b, and the number of variables where observation
i is zero and observation j is one is c.

See **[MV]** *measure_option* to see a similar table for comparison between
variables.

Stata treats nonzero values as one when a binary value is expected.
Specifying one of the binary similarity measures imposes this behavior
unless some other option overrides it (for instance, the **allbinary** option
of **[MV] matrix dissimilarity**). See **[MV]** *measure_option* for a discussion
of binary similarity measures applied to averages.

The following binary similarity coefficients are available. Unless
stated otherwise, the similarity measures range from 0 to 1.

**matching**
requests the simple matching (Zubin 1938, Sokal and Michener 1958)
binary similarity coefficient

(a+d)/(a+b+c+d)

which is the proportion of matches between the 2 observations or
variables.

**Jaccard**
requests the Jaccard (1901, 1908) binary similarity coefficient

a/(a+b+c)

which is the proportion of matches when at least one of the vectors
had a one. If both vectors are all zeros, this measure is undefined.
Stata then declares the answer to be one, meaning perfect agreement.
This is a reasonable choice for most applications and will cause an
all-zero vector to have similarity of one only with another all-zero
vector. In all other cases, an all-zero vector will have Jaccard
similarity of zero to the other vector.

The Jaccard coefficient was discovered earlier by Gilbert (1884).

**Russell**
requests the Russell and Rao (1940) binary similarity coefficient

a/(a+b+c+d)

**Hamann**
requests the Hamann (1961) binary similarity coefficient

((a+d)-(b+c))/(a+b+c+d)

which is the number of agreements minus disagreements divided by the
total. The Hamann coefficient ranges from -1, perfect disagreement,
to 1, perfect agreement. The Hamann coefficient is equal to twice
the simple matching coefficient minus 1.

**Dice**
requests the Dice binary similarity coefficient

2a/(2a+b+c)

suggested by Czekanowski (1932), Dice (1945), and Sørensen (1948).
The Dice coefficient is similar to the Jaccard similarity coefficient
but gives twice the weight to agreements. Like the Jaccard
coefficient, the Dice coefficient is declared by Stata to be one if
both vectors are all zero, thus avoiding the case where the formula
is undefined.

**antiDice**
requests the binary similarity coefficient

a/(a+2(b+c))

which is credited to Anderberg (1973) but was shown earlier by Sokal
and Sneath (1963, 129). We did not call this the Anderberg
coefficient because there is another coefficient better known by that
name; see the **Anderberg** option. The name **antiDice** is our creation.
This coefficient takes the opposite view from the Dice coefficient
and gives double weight to disagreements. As with the Jaccard and
Dice coefficients, the anti-Dice coefficient is declared to be one if
both vectors are all zeros.

**Sneath**
requests the Sneath and Sokal (1962) binary similarity coefficient

2(a+d)/{2(a+d)+(b+c)}

which is similar to the simple matching coefficient but gives double
weight to matches. Also compare the Sneath and Sokal coefficient
with the Dice coefficient, which differs only in whether it includes
d.

**Rogers**
requests the Rogers and Tanimoto (1960) binary similarity coefficient

(a+d)/{(a+d)+2(b+c)}

which takes the opposite approach from the Sneath and Sokal
coefficient and gives double weight to disagreements. Also compare
the Rogers and Tanimoto coefficient with the anti-Dice coefficient,
which differs only in whether it includes d.

**Ochiai**
requests the Ochiai (1957) binary similarity coefficient

a/sqrt((a+b)(a+c))

The formula for the Ochiai coefficient is undefined when one or both
of the vectors being compared are all zeros. If both are all zeros,
Stata declares the measure to be one, and if only one of the two
vectors is all zeros, the measure is declared to be zero.

The Ochiai coefficient was presented earlier by Driver and Kroeber
(1932).

**Yule**
requests the Yule (see Yule [1900] and Yule and Kendall [1950])
binary similarity coefficient

(ad-bc)/(ad+bc)

which ranges from -1 to 1. The formula for the Yule coefficient is
undefined when one or both of the vectors are either all zeros or all
ones. Stata declares the measure to be 1 when b+c = 0, meaning that
there is complete agreement. Stata declares the measure to be -1
when a+d = 0, meaning that there is complete disagreement.
Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These
rules, applied before using the Yule formula, avoid the cases where
the formula would produce an undefined result.

**Anderberg**
requests the Anderberg binary similarity coefficient

(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4

The Anderberg coefficient is undefined when one or both vectors are
either all zeros or all ones. This difficulty is overcome by first
applying the rule that if both vectors are all ones (or both vectors
are all zeros), the similarity measure is declared to be one.
Otherwise, if any of the marginal totals (a+b, a+c, c+d, b+d) are
zero, then the similarity measure is declared to be zero.

Though this similarity coefficient is best known as the Anderberg
coefficient, it appeared earlier in Sokal and Sneath (1963, 130).

**Kulczynski**
requests the Kulczyński (1927) binary similarity coefficient

(a/(a+b) + a/(a+c))/2

The formula for this measure is undefined when one or both of the
vectors are all zeros. If both vectors are all zeros, Stata declares
the similarity measure to be one. If only one of the vectors is all
zeros, the similarity measure is declared to be zero.

**Pearson**
requests Pearson's (1900) phi binary similarity coefficient

(ad-bc)/sqrt((a+b)(a+c)(d+b)(d+c))

which ranges from -1 to 1. The formula for this coefficient is
undefined when one or both of the vectors are either all zeros or all
ones. Stata declares the measure to be 1 when b+c = 0, meaning that
there is complete agreement. Stata declares the measure to be -1
when a+d = 0, meaning that there is complete disagreement.
Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These
rules, applied before using Pearson's phi coefficient formula, avoid
the cases where the formula would produce an undefined result.

**Gower2**
requests the binary similarity coefficient

ad/sqrt((a+b)(a+c)(d+b)(d+c))

which is presented by Gower (1985) but appeared earlier in Sokal and
Sneath (1963, 130). Stata uses the name **Gower2** to avoid confusion
with the better-known Gower coefficient, which is used with a mix of
binary and continuous data.

The formula for this similarity measure is undefined when one or both
of the vectors are all zeros or all ones. This is overcome by first
applying the rule that if both vectors are all ones (or both vectors
are all zeros) then the similarity measure is declared to be one.
Otherwise, if ad = 0, the similarity measure is declared to be zero.

__Dissimilarity measure for mixed data__

Here is one measure that works with a mix of binary and continuous data.
Binary variables are those containing only zeros, ones, and missing
values; all other variables are continuous. The formulas below are for
the dissimilarity between observations; see **[MV]** *measure_option* for the
formulas for the dissimilarity between variables (not presented here).

**Gower**
requests the Gower (1971) dissimilarity coefficient for a mix of
binary and continuous variables

sum(delta_ijv*d_ijv)/sum(delta_ijv)

where delta_ijv is a binary indicator equal to 1 whenever both
observations i and j are nonmissing for variable v, and zero
otherwise. Observations with missing values are not included when
using **cluster** or **mds**, and so if an observation is included, delta_ijv
= 1 and sum(delta_ijv) is the number of variables. However, using
**matrix dissimilarity** with the **Gower** option does not exclude
observations with missing values. See **[MV] cluster**, **[MV] mds**, and
**[MV] matrix dissimilarity**.

For binary variables v,

d_ijv = 0 if x_iv=x_jv
= 1 otherwise

This is the same as the **matching** measure.

For continuous variables v,

d_ijv = |x_iv - x_jv|/(max_k(x_kv)-min_k(x_kv))

d_ijv is set to 0 if (max_k(x_kv)-min_k(x_kv)) is zero, that is, if
the range of the variable is zero. This is the **L1** measure divided by
the range of the variable.

The Gower measure interprets binary variables as those with only 0,
1, or missing values. All other variables are treated as continuous.

In **[MV] matrix dissimilarity**, missing observations are included only
in the calculation of the **Gower** dissimilarity, but the formula for
this dissimilarity measure is undefined when all the values of
delta_ijv or delta_iuv are zero. The dissimilarity is then set to
missing.

__Technical note__

Normally the commands

**. matrix dissimilarity gm = x1 x2 y1, Gower**
**. clustermat waverage gm, add**

and

**. cluster waverage x1 x2 y1, measure(Gower)**

will yield the same results, and likewise with **mdsmat** and **mds**.
However, if any of the variables contain missing observations, this
will not be the case. **cluster** and **mds** exclude all observations that
have missing values for any of the variables of interest, whereas
**matrix** **dissimilarity** with the **Gower** option does not. See **[MV]**
**cluster**, **[MV] mds**, and **[MV] matrix dissimilarity** for more
information.

Note: **matrix dissimilarity** without the **Gower** option does exclude all
observations that have missing values for any of the variables of
interest.

__References__

Anderberg, M. R. 1973. *Cluster Analysis for Applications* New York:
Academic Press.

Czekanowski, J. 1932. "Coefficient of racial likeness" und
"durchschnittliche Differenz". *Anthropologischer Anzeiger* 9:
227-249.

Dice, L. R. 1945. Measures of the amount of ecologic associate between
species. *Ecology* 26: 297-302.

Driver, H. E., and A. L. Kroeber. 1932. Quantitative expression of
cultural relationships. *University of California Publications in*
*American* *Archaeology and Ethnology* 31: 211-256.

Gilbert, G. K. 1884. Finley's tornado predictions. *American*
*Meteorological Journal* 1: 166-172.

Gordon, A. D. 1999. *Classification*. 2nd ed. Boca Raton, FL: Chapman &
Hall/CRC.

Gower, J. C. 1971. A general coefficient of similarity and some of its
properties. *Biometrics* 27: 857-871.

------. 1985. Measures of similarity, dissimilarity, and distance. In
Vol. 5 of *Encyclopedia of Statistical Sciences*, ed. S. Kotz, N. L.
Johnson, and C. B. Read, 397-405. New York: Wiley.

Hamann, U. 1961. Merkmalsbestand und Verwandtschaftsbeziehungen der
Farinosae. Ein Beitrag zum System der Monokotyledonen. *Willdenowia*
2: 639-768.

Jaccard, P. 1901. Distribution de la flore alpine dans le Bassin des
Dranses et dans quelques régions voisines. *Bulletin de la Société*
*Vaudoise des Sciences Naturelles* 37: 241-272.

------. 1908. Nouvelles recherches sur la distribution florale. *Bulletin*
*de la Société Vaudoise des Sciences Naturelles* 44: 223-270.

Kulczyński, S. 1927. Die Pflanzenassoziationen der Pieninen [In Polish,
German summary]. *Bulletin International de l'Academie Polonaise des*
*Sciences et des Lettres, Classe des Sciences Mathematiques et*
*Naturelles, B (Sciences Naturelles)* Suppl. II: 57-203.

Ochiai, A. 1957. Zoogeographic studies on the soleoid fishes found in
Japan and its neighbouring regions [in Japanese, English summary].
*Bulletin of the Japanese Society of Scientific Fisheries* 22: 526-530.

Pearson, K. 1900. Mathematical contributions to the theory of evolution
-- VII. On the correlation of characters not quantitatively
measurable. *Philosophical Transactions of the Royal Society of*
*London, Series A* 195: 1-47.

Rogers, D. J., and T. T. Tanimoto. 1960. A computer program for
classifying plants. *Science* 132: 1115-1118.

Russell, P. F., and T. R. Rao. 1940. On habitat and association of
species of anopheline larvae in south-eastern Madras. *Journal of the*
*Malaria Institute of India* 3: 153-178.

Sneath, P. H. A., and R. R. Sokal. 1962. Numerical taxonomy. *Nature* 193:
855-860.

Sokal, R. R., and C. D. Michener. 1958. A statistical method for
evaluating systematic relationships. *University of Kansas Science*
*Bulletin* 28: 1409-1438.

Sokal, R. R., and P. H. A. Sneath. 1963. *Principles of Numerical*
*Taxonomy*. San Francisco: Freeman.

Sørensen, T. 1948. A method of establishing groups of equal amplitude in
plant sociology based on similarity of species content and its
application to analyses of the vegetation on Danish commons. *Royal*
*Danish Academy of Sciences and Letters, Biological Series* 5: 1-34.

Yule, G. U. 1900. On the association of attributes in statistics: With
illustrations from the material of the Childhood Society, etc.
*Philosophical Transactions of the Royal Society, Series A* 194:
257-319.

Yule, G. U., and M. G. Kendall. 1950. *An Introduction to the Theory of*
*Statistics*. 14th ed. New York: Hafner.

Zubin, J. 1938. A technique for measuring like-mindedness. *Journal of*
*Abnormal and Social Psychology* 33: 508-516.