Stata 11 help for measure_option

help measure_option -------------------------------------------------------------------------------

Title

[MV] measure_option -- Option for similarity and dissimilarity measures

Syntax

command ..., ... measure(measure) ...

or

command ..., ... measure ...

measure description ------------------------------------------------------------------------- cont_measure similarity or dissimilarity measure for continuous data binary_measure similarity measure for binary data mixed_measure dissimilarity measure for a mix of binary and continuous data -------------------------------------------------------------------------

cont_measure description ------------------------------------------------------------------------- L2 Euclidean distance (Minkowski with argument 2) Euclidean alias for L2 L(2) alias for L2 L2squared squared Euclidean distance Lpower(2) alias for L2squared L1 absolute-value distance (Minkowski with argument 1) absolute alias for L1 cityblock alias for L1 manhattan alias for L1 L(1) alias for L1 Lpower(1) alias for L1 Linfinity maximum-value distance (Minkowski with infinite argument) maximum alias for Linfinity L(#) Minkowski distance with # argument Lpower(#) Minkowski distance with # argument raised to # power Canberra Canberra distance correlation correlation coefficient similarity measure angular angular separation similarity measure angle alias for angular -------------------------------------------------------------------------

binary_measure description ------------------------------------------------------------------------- matching simple matching similarity coefficient Jaccard Jaccard binary similarity coefficient Russell Russell and Rao similarity coefficient Hamann Hamann similarity coefficient Dice Dice similarity coefficient antiDice anti-Dice similarity coefficient Sneath Sneath and Sokal similarity coefficient Rogers Rogers and Tanimoto similarity coefficient Ochiai Ochiai similarity coefficient Yule Yule similarity coefficient Anderberg Anderberg similarity coefficient Kulczynski Kulczynski similarity coefficient Pearson Pearson's phi similarity coefficient Gower2 similarity coefficient with same denominator as Pearson -------------------------------------------------------------------------

mixed_measure description ------------------------------------------------------------------------- Gower Gower's dissimilarity coefficient -------------------------------------------------------------------------

Description

Several commands have options that allow you to specify a similarity or dissimilarity measure designated as measure in the syntax; see [MV] cluster, [MV] mds, [MV] discrim knn, and [MV] matrix dissimilarity. These options are documented here. Most analysis commands (e.g., cluster and mds) transform similarity measures to dissimilarity measures as needed.

Options

Measures are divided into those for continuous data, binary data, and a mix of continuous and binary measures. measure is not case sensitive. Full definitions are presented in Similarity and dissimilarity measures for continuous data, Similarity measures for binary data, and Dissimilarity measures for mixed data.

The similarity or dissimilarity measure is most often used to determine the similarity or dissimilarity between observations. However, sometimes the similarity or dissimilarity between variables is of interest.

Similarity and dissimilarity measures for continuous data

Here are the similarity and dissimilarity measures for continuous data available in Stata. In the following formulas, p represents the number of variables, N is the number of observations, and x_iv denotes the value of observation i for variable v. See [MV] measure_option for the formulas for the similarity and dissimilarity measures between variables (not presented here).

L2 (aliases Euclidean and L(2)) requests the Minkowski distance metric with argument 2

sqrt(sum((x_ia - x_ja)^2))

L2 is best known as Euclidean distance and is the default dissimilarity measure for discrim knn, mds, matrix dissimilarity, and all the cluster subcommands except for centroidlinkage, medianlinkage, and wardslinkage, which default to using L2squared; see [MV] discrim knn, [MV] mds, [MV] matrix dissimilarity, and [MV] cluster.

L2squared (alias Lpower(2)) requests the square of the Minkowski distance metric with argument 2

sum((x_ia - x_ja)^2)

L2squared is best known as squared Euclidean distance and is the default dissimilarity measure for the centroidlinkage, medianlinkage, and wardslinkage subcommands of cluster; see [MV] cluster.

L1 (aliases absolute, cityblock, manhattan, and L(1)) requests the Minkowski distance metric with argument 1

sum(|x_ia - x_ja|)

which is best known as absolute-value distance.

Linfinity (alias maximum) requests the Minkowski distance metric with infinite argument

max(|x_ia - x_ja|)

and is best known as maximum-value distance.

L(#) requests the Minkowski distance metric with argument #:

(sum(|x_ia - x_ja|^#)^(1/#) # >= 1

We discourage using extremely large values for #. Because the absolute value of the difference is being raised to the value of #, depending on the nature of your data, you could experience numeric overflow or underflow. With a large value of #, the L() option will produce results similar to those of the Linfinity option. Use the numerically more stable Linfinity option instead of a large value for # in the L() option.

Lpower(#) requests the Minkowski distance metric with argument #, raised to the # power:

sum(|x_ia - x_ja|^#) # >= 1

As with L(#), we discourage using extremely large values for #; see the discussion above.

Canberra requests the following distance metric

sum(|x_ia - x_ja|/(|x_ia|+|x_ja|))

which ranges from 0 to p, the number of variables. The Canberra distance is sensitive to small changes near zero.

correlation requests the correlation coefficient similarity measure,

sum((x_ia-xbar_i.)(x_ja-xbar_j.)) ---------------------------------------------- sqrt(sum(x_ia-xbar_i.)^2 * sum(x_jb-xbar_j.)^2)

where xbar_i. = sum(x_ia)/p.

The correlation similarity measure takes values between -1 and 1. With this measure, the relative direction of the two vectors is important. The correlation similarity measure is related to the angular separation similarity measure (described next). The correlation similarity measure gives the cosine of the angle between the two vectors measured from the mean.

angular (alias angle) requests the angular separation similarity measure

sum(x_ia * x_ja)/sqrt(sum(x_ia^2) * sum(x_jb^2))

which is the cosine of the angle between the two vectors measured from zero and takes values from -1 to 1.

Similarity measures for binary data

Similarity measures for binary data are based on the four values from the cross-tabulation of observation i and j (when comparing observations) or variables u and v (when comparing variables).

For comparing observations i and j, the cross-tabulation is

| obs. j | 1 0 -------+------- obs. 1 | a b i 0 | c d

a is the number of variables where observations i and j both had ones, and d is the number of variables where observations i and j both had zeros. The number of variables where observation i is one and observation j is zero is b, and the number of variables where observation i is zero and observation j is one is c.

See [MV] measure_option to see a similar table for comparison between variables.

Stata treats nonzero values as one when a binary value is expected. Specifying one of the binary similarity measures imposes this behavior unless some other option overrides it (for instance, the allbinary option of [MV] matrix dissimilarity). See [MV] measure_option for a discussion of binary similarity measures applied to averages.

The following binary similarity coefficients are available. Unless stated otherwise, the similarity measures range from 0 to 1.

matching requests the simple matching binary similarity coefficient

(a+d)/(a+b+c+d)

which is the proportion of matches between the 2 observations or variables.

Jaccard requests the Jaccard binary similarity coefficient

a/(a+b+c)

which is the proportion of matches when at least one of the vectors had a one. If both vectors are all zeros, this measure is undefined. Stata then declares the answer to be one, meaning perfect agreement. This is a reasonable choice for most applications and will cause an all-zero vector to have similarity of one only with another all-zero vector. In all other cases, an all-zero vector will have Jaccard similarity of zero to the other vector.

Russell requests the Russell and Rao binary similarity coefficient

a/(a+b+c+d)

Hamann requests the Hamann binary similarity coefficient

((a+d)-(b+c))/(a+b+c+d)

which is the number of agreements minus disagreements divided by the total. The Hamann coefficient ranges from -1, perfect disagreement, to 1, perfect agreement. The Hamann coefficient is equal to twice the simple matching coefficient minus 1.

Dice requests the Dice binary similarity coefficient

2a/(2a+b+c)

The Dice coefficient is similar to the Jaccard similarity coefficient but gives twice the weight to agreements. Like the Jaccard coefficient, the Dice coefficient is declared by Stata to be one if both vectors are all zero, thus avoiding the case where the formula is undefined.

antiDice requests the binary similarity coefficient

a/(a+2(b+c))

The name antiDice is our creation. This coefficient takes the opposite view from the Dice coefficient and gives double weight to disagreements. As with the Jaccard and Dice coefficients, the anti-Dice coefficient is declared to be one if both vectors are all zeros.

Sneath requests the Sneath and Sokal binary similarity coefficient

2(a+d)/(2(a+d)+(b+c))

which is similar to the simple matching coefficient but gives double weight to matches. Also compare the Sneath and Sokal coefficient with the Dice coefficient, which differs only in whether it includes d.

Rogers requests the Rogers and Tanimoto binary similarity coefficient

(a+d)/((a+d)+2(b+c))

which takes the opposite approach from the Sneath and Sokal coefficient and gives double weight to disagreements. Also compare the Rogers and Tanimoto coefficient with the anti-Dice coefficient, which differs only in whether it includes d.

Ochiai requests the Ochiai binary similarity coefficient

a/sqrt((a+b)(a+c))

The formula for the Ochiai coefficient is undefined when one or both of the vectors being compared are all zeros. If both are all zeros, Stata declares the measure to be one, and if only one of the two vectors is all zeros, the measure is declared to be zero.

Yule requests the Yule binary similarity coefficient

(ad-bc)/(ad+bc)

which ranges from -1 to 1. The formula for the Yule coefficient is undefined when one or both of the vectors are either all zeros or all ones. Stata declares the measure to be 1 when b+c = 0, meaning that there is complete agreement. Stata declares the measure to be -1 when a+d = 0, meaning that there is complete disagreement. Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These rules, applied before using the Yule formula, avoid the cases where the formula would produce an undefined result.

Anderberg requests the Anderberg binary similarity coefficient

(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4

The Anderberg coefficient is undefined when one or both vectors are either all zeros or all ones. This difficulty is overcome by first applying the rule that if both vectors are all ones (or both vectors are all zeros), the similarity measure is declared to be one. Otherwise, if any of the marginal totals (a+b, a+c, c+d, b+d) are zero, then the similarity measure is declared to be zero.

Kulczynski requests the Kulczynski binary similarity coefficient

(a/(a+b) + a/(a+c))/2

The formula for this measure is undefined when one or both of the vectors are all zeros. If both vectors are all zeros, Stata declares the similarity measure to be one. If only one of the vectors is all zeros, the similarity measure is declared to be zero.

Pearson requests Pearson's phi binary similarity coefficient

(ad-bc)/sqrt((a+b)(a+c)(d+b)(d+c))

which ranges from -1 to 1. The formula for this coefficient is undefined when one or both of the vectors are either all zeros or all ones. Stata declares the measure to be 1 when b+c = 0, meaning that there is complete agreement. Stata declares the measure to be -1 when a+d = 0, meaning that there is complete disagreement. Otherwise, if ad-bc = 0, Stata declares the measure to be 0. These rules, applied before using Pearson's phi coefficient formula, avoid the cases where the formula would produce an undefined result.

Gower2 requests the binary similarity coefficient

ad/sqrt((a+b)(a+c)(d+b)(d+c))

Stata uses the name Gower2 to avoid confusion with the better-known Gower coefficient.

The formula for this similarity measure is undefined when one or both of the vectors are all zeros or all ones. This is overcome by first applying the rule that if both vectors are all ones (or both vectors are all zeros) then the similarity measure is declared to be one. Otherwise, if ad = 0, the similarity measure is declared to be zero.

Dissimilarity measure for mixed data

Here is one measure that works with a mix of binary and continuous data. Binary variables are those containing only zeros, ones, and missing values; all other variables are continuous. The formulas below are for the dissimilarity between observations; see [MV] measure_option for the formulas for the dissimilarity between variables (not presented here).

Gower requests the Gower dissimilarity coefficient for a mix of binary and continuous variables

sum(delta_ijv*d_ijv)/sum(delta_ijv)

where delta_ijv is a binary indicator equal to one whenever both observations i and j are nonmissing for variable v, and zero otherwise. Observations with missing values are not included when using cluster or mds, and so if an observation is included, delta_ijv = 1 and sum(delta_ijv) is the number of variables. However, using matrix dissimilarity with the Gower option does not exclude observations with missing values. See [MV] cluster, [MV] mds, and [MV] matrix dissimilarity.

For binary variables v,

d_ijv = 0 if x_iv=x_jv = 1 otherwise

This is the same as the matching measure.

For continuous variables v,

d_ijv = |x_iv - x_jv|/(max_k(x_kv)-min_k(x_kv))

d_ijv is set to 0 if (max_k(x_kv)-min_k(x_kv)) is zero, i.e., if the range of the variable is zero. This is the L1 measure divided by the range of the variable.

The Gower measure interprets binary variables as those with only 0, 1, or missing values. All other variables are treated as continuous.

In [MV] matrix dissimilarity, missing observations are included only in the calculation of the Gower dissimilarity, but the formula for this dissimilarity measure is undefined when all of the values of delta_ijv or delta_iuv are zero. The dissimilarity is then set to missing.

Technical note

Normally the commands

. matrix dissimilarity gm = x1 x2 y1, Gower . clustermat waverage gm, add

and

. cluster waverage x1 x2 y1, measure(Gower)

will yield the same results, and likewise with mdsmat and mds. However, if any of the variables contain missing observations, this will not be the case. cluster and mds exclude all observations that have missing values for any of the variables of interest, whereas matrix dissimilarity with the Gower option does not. See [MV] cluster, [MV] mds, and [MV] matrix dissimilarity for more information.

Note: matrix dissimilarity without the Gower option does exclude all observations that have missing values for any of the variables of interest.

Also see

Manual: [MV] measure_option

Help: [MV] cluster, [P] matrix dissimilarity; [MV] parse_dissim


© Copyright 1996–2009 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index