help measure_option
-------------------------------------------------------------------------------
Title
[MV] measure_option -- Option for similarity and dissimilarity measures
Syntax
command ..., ... measure(measure) ...
or
command ..., ... measure ...
measure description
-------------------------------------------------------------------------
cont_measure similarity or dissimilarity measure for continuous data
binary_measure similarity measure for binary data
mixed_measure dissimilarity measure for a mix of binary and
continuous data
-------------------------------------------------------------------------
cont_measure description
-------------------------------------------------------------------------
L2 Euclidean distance (Minkowski with argument 2)
Euclidean alias for L2
L(2) alias for L2
L2squared squared Euclidean distance
Lpower(2) alias for L2squared
L1 absolute-value distance (Minkowski with argument 1)
absolute alias for L1
cityblock alias for L1
manhattan alias for L1
L(1) alias for L1
Lpower(1) alias for L1
Linfinity maximum-value distance (Minkowski with infinite
argument)
maximum alias for Linfinity
L(#) Minkowski distance with # argument
Lpower(#) Minkowski distance with # argument raised to # power
Canberra Canberra distance
correlation correlation coefficient similarity measure
angular angular separation similarity measure
angle alias for angular
-------------------------------------------------------------------------
binary_measure description
-------------------------------------------------------------------------
matching simple matching similarity coefficient
Jaccard Jaccard binary similarity coefficient
Russell Russell and Rao similarity coefficient
Hamann Hamann similarity coefficient
Dice Dice similarity coefficient
antiDice anti-Dice similarity coefficient
Sneath Sneath and Sokal similarity coefficient
Rogers Rogers and Tanimoto similarity coefficient
Ochiai Ochiai similarity coefficient
Yule Yule similarity coefficient
Anderberg Anderberg similarity coefficient
Kulczynski Kulczynski similarity coefficient
Pearson Pearson's phi similarity coefficient
Gower2 similarity coefficient with same denominator as Pearson
-------------------------------------------------------------------------
mixed_measure description
-------------------------------------------------------------------------
Gower Gower's dissimilarity coefficient
-------------------------------------------------------------------------
Description
Several commands have options that allow you to specify a similarity or
dissimilarity measure designated as measure in the syntax; see [MV]
cluster, [MV] mds, [MV] discrim knn, and [MV] matrix dissimilarity.
These options are documented here. Most analysis commands (e.g., cluster
and mds) transform similarity measures to dissimilarity measures as
needed.
Options
Measures are divided into those for continuous data, binary data, and a
mix of continuous and binary measures. measure is not case sensitive.
Full definitions are presented in Similarity and dissimilarity measures
for continuous data, Similarity measures for binary data, and
Dissimilarity measures for mixed data.
The similarity or dissimilarity measure is most often used to determine
the similarity or dissimilarity between observations. However, sometimes
the similarity or dissimilarity between variables is of interest.
Similarity and dissimilarity measures for continuous data
Here are the similarity and dissimilarity measures for continuous data
available in Stata. In the following formulas, p represents the number
of variables, N is the number of observations, and x_iv denotes the value
of observation i for variable v. See [MV] measure_option for the
formulas for the similarity and dissimilarity measures between variables
(not presented here).
L2 (aliases Euclidean and L(2))
requests the Minkowski distance metric with argument 2
sqrt(sum((x_ia - x_ja)^2))
L2 is best known as Euclidean distance and is the default
dissimilarity measure for discrim knn, mds, matrix dissimilarity, and
all the cluster subcommands except for centroidlinkage,
medianlinkage, and wardslinkage, which default to using L2squared;
see [MV] discrim knn, [MV] mds, [MV] matrix dissimilarity, and [MV]
cluster.
L2squared (alias Lpower(2))
requests the square of the Minkowski distance metric with argument 2
sum((x_ia - x_ja)^2)
L2squared is best known as squared Euclidean distance and is the
default dissimilarity measure for the centroidlinkage, medianlinkage,
and wardslinkage subcommands of cluster; see [MV] cluster.
L1 (aliases absolute, cityblock, manhattan, and L(1))
requests the Minkowski distance metric with argument 1
sum(|x_ia - x_ja|)
which is best known as absolute-value distance.
Linfinity (alias maximum)
requests the Minkowski distance metric with infinite argument
max(|x_ia - x_ja|)
and is best known as maximum-value distance.
L(#)
requests the Minkowski distance metric with argument #:
(sum(|x_ia - x_ja|^#)^(1/#) # >= 1
We discourage using extremely large values for #. Because the
absolute value of the difference is being raised to the value of #,
depending on the nature of your data, you could experience numeric
overflow or underflow. With a large value of #, the L() option will
produce results similar to those of the Linfinity option. Use the
numerically more stable Linfinity option instead of a large value for
# in the L() option.
Lpower(#)
requests the Minkowski distance metric with argument #, raised to the
# power:
sum(|x_ia - x_ja|^#) # >= 1
As with L(#), we discourage using extremely large values for #; see
the discussion above.
Canberra
requests the following distance metric
sum(|x_ia - x_ja|/(|x_ia|+|x_ja|))
which ranges from 0 to p, the number of variables. The Canberra
distance is sensitive to small changes near zero.
correlation
requests the correlation coefficient similarity measure,
sum((x_ia-xbar_i.)(x_ja-xbar_j.))
----------------------------------------------
sqrt(sum(x_ia-xbar_i.)^2 * sum(x_jb-xbar_j.)^2)
where xbar_i. = sum(x_ia)/p.
The correlation similarity measure takes values between -1 and 1.
With this measure, the relative direction of the two vectors is
important. The correlation similarity measure is related to the
angular separation similarity measure (described next). The
correlation similarity measure gives the cosine of the angle between
the two vectors measured from the mean.
angular (alias angle)
requests the angular separation similarity measure
sum(x_ia * x_ja)/sqrt(sum(x_ia^2) * sum(x_jb^2))
which is the cosine of the angle between the two vectors measured
from zero and takes values from -1 to 1.
Similarity measures for binary data
Similarity measures for binary data are based on the four values from the
cross-tabulation of observation i and j (when comparing observations) or
variables u and v (when comparing variables).
For comparing observations i and j, the cross-tabulation is
| obs. j
| 1 0
-------+-------
obs. 1 | a b
i 0 | c d
a is the number of variables where observations i and j both had ones,
and d is the number of variables where observations i and j both had
zeros. The number of variables where observation i is one and
observation j is zero is b, and the number of variables where observation
i is zero and observation j is one is c.
See [MV] measure_option to see a similar table for comparison between
variables.
Stata treats nonzero values as one when a binary value is expected.
Specifying one of the binary similarity measures imposes this behavior
unless some other option overrides it (for instance, the allbinary option
of [MV] matrix dissimilarity). See [MV] measure_option for a discussion
of binary similarity measures applied to averages.
The following binary similarity coefficients are available. Unless
stated otherwise, the similarity measures range from 0 to 1.
matching
requests the simple matching binary similarity coefficient
(a+d)/(a+b+c+d)
which is the proportion of matches between the 2 observations or
variables.
Jaccard
requests the Jaccard binary similarity coefficient
a/(a+b+c)
which is the proportion of matches when at least one of the vectors
had a one. If both vectors are all zeros, this measure is undefined.
Stata then declares the answer to be one, meaning perfect agreement.
This is a reasonable choice for most applications and will cause an
all-zero vector to have similarity of one only with another all-zero
vector. In all other cases, an all-zero vector will have Jaccard
similarity of zero to the other vector.
Russell
requests the Russell and Rao binary similarity coefficient
a/(a+b+c+d)
Hamann
requests the Hamann binary similarity coefficient
((a+d)-(b+c))/(a+b+c+d)
which is the number of agreements minus disagreements divided by the
total. The Hamann coefficient ranges from -1, perfect disagreement,
to 1, perfect agreement. The Hamann coefficient is equal to twice
the simple matching coefficient minus 1.
Dice
requests the Dice binary similarity coefficient
2a/(2a+b+c)
The Dice coefficient is similar to the Jaccard similarity coefficient
but gives twice the weight to agreements. Like the Jaccard
coefficient, the Dice coefficient is declared by Stata to be one if
both vectors are all zero, thus avoiding the case where the formula
is undefined.
antiDice
requests the binary similarity coefficient
a/(a+2(b+c))
The name antiDice is our creation. This coefficient takes the
opposite view from the Dice coefficient and gives double weight to
disagreements. As with the Jaccard and Dice coefficients, the
anti-Dice coefficient is declared to be one if both vectors are all
zeros.
Sneath
requests the Sneath and Sokal binary similarity coefficient
2(a+d)/(2(a+d)+(b+c))
which is similar to the simple matching coefficient but gives double
weight to matches. Also compare the Sneath and Sokal coefficient
with the Dice coefficient, which differs only in whether it includes
d.
Rogers
requests the Rogers and Tanimoto binary similarity coefficient
(a+d)/((a+d)+2(b+c))
which takes the opposite approach from the Sneath and Sokal
coefficient and gives double weight to disagreements. Also compare
the Rogers and Tanimoto coefficient with the anti-Dice coefficient,
which differs only in whether it includes d.
Ochiai
requests the Ochiai binary similarity coefficient
a/sqrt((a+b)(a+c))
The formula for the Ochiai coefficient is undefined when one or both
of the vectors being compared are all zeros. If both are all zeros,
Stata declares the measure to be one, and if only one of the two
vectors is all zeros, the measure is declared to be zero.
Yule
requests the Yule binary similarity coefficient
(ad-bc)/(ad+bc)
which ranges from -1 to 1. The formula for the Yule coefficient is
undefined when one or both of the vectors are either all zeros or all
ones. Stata declares the measure to be 1 when b+c = 0, meaning that
there is complete agreement. Stata declares the measure to be -1
when a+d = 0, meaning that there is complete disagreement.
Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These
rules, applied before using the Yule formula, avoid the cases where
the formula would produce an undefined result.
Anderberg
requests the Anderberg binary similarity coefficient
(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4
The Anderberg coefficient is undefined when one or both vectors are
either all zeros or all ones. This difficulty is overcome by first
applying the rule that if both vectors are all ones (or both vectors
are all zeros), the similarity measure is declared to be one.
Otherwise, if any of the marginal totals (a+b, a+c, c+d, b+d) are
zero, then the similarity measure is declared to be zero.
Kulczynski
requests the Kulczynski binary similarity coefficient
(a/(a+b) + a/(a+c))/2
The formula for this measure is undefined when one or both of the
vectors are all zeros. If both vectors are all zeros, Stata declares
the similarity measure to be one. If only one of the vectors is all
zeros, the similarity measure is declared to be zero.
Pearson
requests Pearson's phi binary similarity coefficient
(ad-bc)/sqrt((a+b)(a+c)(d+b)(d+c))
which ranges from -1 to 1. The formula for this coefficient is
undefined when one or both of the vectors are either all zeros or all
ones. Stata declares the measure to be 1 when b+c = 0, meaning that
there is complete agreement. Stata declares the measure to be -1
when a+d = 0, meaning that there is complete disagreement.
Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These
rules, applied before using Pearson's phi coefficient formula, avoid
the cases where the formula would produce an undefined result.
Gower2
requests the binary similarity coefficient
ad/sqrt((a+b)(a+c)(d+b)(d+c))
Stata uses the name Gower2 to avoid confusion with the better-known
Gower coefficient.
The formula for this similarity measure is undefined when one or both
of the vectors are all zeros or all ones. This is overcome by first
applying the rule that if both vectors are all ones (or both vectors
are all zeros) then the similarity measure is declared to be one.
Otherwise, if ad = 0, the similarity measure is declared to be zero.
Dissimilarity measure for mixed data
Here is one measure that works with a mix of binary and continuous data.
Binary variables are those containing only zeros, ones, and missing
values; all other variables are continuous. The formulas below are for
the dissimilarity between observations; see [MV] measure_option for the
formulas for the dissimilarity between variables (not presented here).
Gower
requests the Gower dissimilarity coefficient for a mix of binary and
continuous variables
sum(delta_ijv*d_ijv)/sum(delta_ijv)
where delta_ijv is a binary indicator equal to one whenever both
observations i and j are nonmissing for variable v, and zero
otherwise. Observations with missing values are not included when
using cluster or mds, and so if an observation is included, delta_ijv
= 1 and sum(delta_ijv) is the number of variables. However, using
matrix dissimilarity with the Gower option does not exclude
observations with missing values. See [MV] cluster, [MV] mds, and
[MV] matrix dissimilarity.
For binary variables v,
d_ijv = 0 if x_iv=x_jv
= 1 otherwise
This is the same as the matching measure.
For continuous variables v,
d_ijv = |x_iv - x_jv|/(max_k(x_kv)-min_k(x_kv))
d_ijv is set to 0 if (max_k(x_kv)-min_k(x_kv)) is zero, i.e., if the
range of the variable is zero. This is the L1 measure divided by the
range of the variable.
The Gower measure interprets binary variables as those with only 0,
1, or missing values. All other variables are treated as continuous.
In [MV] matrix dissimilarity, missing observations are included only
in the calculation of the Gower dissimilarity, but the formula for
this dissimilarity measure is undefined when all of the values of
delta_ijv or delta_iuv are zero. The dissimilarity is then set to
missing.
Technical note
Normally the commands
. matrix dissimilarity gm = x1 x2 y1, Gower
. clustermat waverage gm, add
and
. cluster waverage x1 x2 y1, measure(Gower)
will yield the same results, and likewise with mdsmat and mds.
However, if any of the variables contain missing observations, this
will not be the case. cluster and mds exclude all observations that
have missing values for any of the variables of interest, whereas
matrix dissimilarity with the Gower option does not. See [MV]
cluster, [MV] mds, and [MV] matrix dissimilarity for more
information.
Note: matrix dissimilarity without the Gower option does exclude all
observations that have missing values for any of the variables of
interest.
Also see
Manual: [MV] measure_option
Help: [MV] cluster, [P] matrix dissimilarity; [MV] parse_dissim