Similarity coefficients for 2 x 2 binary data (STB-9: sg9) --------------------------------------------- ^similari^ # # # # Description ----------- Binary similarity measures estimate the proximity between two 1/0 binary vari- ables. Three type of measures are provided: ones that can be thought of as similar to a correlation coefficient, others that can be interpreted as condi- tional probabilities, as lastly those that are predictability measures. I have provided a program called ^similari^ which displays twelve such similarity coefficients. One need only input the tabulated summary data on the command- line: ^similari^ A(0,0) B(0,1) C(1,0) D(1,1) Hence, you may directly type in from right-to-left summary data from the Stata ^tab^ command. The following statistics are provided: 1. ^Czekanowski (Dice)^: A matching coefficients measure in which double weights are given to matches (1,1). 2. ^Dispersion^: A similarity measure that ranges from -1 to 1. 3. ^Jaccard^: A similarity ratio in which 0,0 is excluded from the equation. 4. ^Match percent^: The ratio of total matches to the total population. 5. ^Ochiai^: A similarity measure in cosine form. 6. ^Phi 4-point^: A binary form of the Pearson product correlation coefficient. 7. ^Russell & Rao^: a binary dot product; 1,1 matches to total population. 8. ^Hamann^: A conditional probability measure ranging in value from -1 to 1. 9. ^Anderberg's D^: A predictability measure indicating the reduction in the probability of error when an item is used to predict another. 10. ^Goodman and Krusal's Lambda^: Indicates the proportional reduction in the probability of error when one item is used to predict another when the prediction directions are equal. The predictability of the value of one item given the value of another. 11. ^Yule's Q^: A binary version of the Gamma test ranging from -1 to 1. 12. ^Yule's Y^: A coefficient of colligation ranging from -1 to 1. All coefficients range from 0 to 1 unless otherwise indicated. An example program run follows: . ^use lbw^ . ^tab smoke low^ smoked| birth weight<2500g during| pregnancy| 0 1 | Total -----------+----------------------+---------- 0 | 86 29 | 115 1 | 44 30 | 74 -----------+----------------------+---------- Total| 130 59 | 189 . ^similar 86 29 44 30^ Similarity coefficients for 2 X 2 binary data Controls Cases | 0 1 | Total -------+--------------------------------+---------- 0 | 86 29 | 115 1 | 44 30 | 74 -------+--------------------------------+---------- Total | 130 59 | 189 Proximity measures Conditional probability measure Czekanowski = 0.4511 Hamann = 0.2275 Dispersion = 0.0365 Jaccard = 0.2913 Predictability measures Match % = 0.6138 Anderbergs D = 0.0026 Ochiai = 0.4540 G & K Lambda = 0.0026 Phi 4-point = 0.1614 Yules Q = 0.3382 Russell & Rao = 0.1587 Yules Y (colligation)= 0.1742 **** Each listed statistic accords with that produced by the Proximity command in SPSS for Windows. References: ---------- Anderberg, M. R. 1973. Cluster Analysis for Applications. New York: Academic Press. Romesburg, H. C. 1984. Cluster Analysis for Researchers. Belmont, CA: Lifetime Learning Publications. SPSS Statistical Algorithms, 2nd Ed. 1991. Chicago: SPSS. Author ------ Joseph Hilbe, Editor, STB, Fax 602-860-1446