Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: RE: st: RE: Econometrics Theory Questions on DummiesandCorrelation Analysis

From   Joseph Coveney <>
To   Statalist <>
Subject   RE: RE: st: RE: Econometrics Theory Questions on DummiesandCorrelation Analysis
Date   Tue, 19 Apr 2005 18:05:54 +0900

SamL wrote:

I did not (mean to) indicate that the variance of a binary variable is
undefined or meaningless.  The variance of a binary variable just doesn't
tell you anything you don't know by knowing the mean, i.e., it is

Also, yes, I know there are lots of ways to think of the correlation
coefficient.  I indicated in my note that I was talking about one way.

To be constructive, let me first be honest--I haven't studied many of the
other ways.  It strikes me that at least one of those other ways might
provide grounds for strong support for reporting Pearson corr coeffs with
binary variables.  But I am at the limit of my knowledge.  Anyone know a
way to derive and defend the Pearson's correlation coeff for binary
variables as an unbiased indicator of association?  I'd love to read it
here or be pointed to a citation--it would help me in my own work and in
my teaching.


I'm not sure that I follow what unbiased indicator of association means in
this context.  My understanding is that Pearson's correlation coefficient
can be used with binary variables without any apology or defense.  It's a
correlation coefficient by virtue of the arithmetic alone.  (This was one of
Nick Cox's points, I thought.)  If you would like to see an instance of it
used as a correlation coefficient for binary variables (if I recall
correctly), take a look at A. D. Lunn and S. J. Davies, A note on generating
correlated binary variables. _Biometrika_ 85(2):487-490, 1998.

In any event, the do-file below might be helpful in the classroom or
computer laboratory in illustrating some of the various measures of
correlation, association or agreement applicable to binary variables that
are available in Stata.  I naively think of them as falling into two groups:
those that measure correlation between the binary variables and those
that estimate correlation of the latent variables that underlie (or that
can be tactically conceived as underlying) the binary variables.  The do-
file illustrates that, in the context of a fourfold table, many familiar
coefficients and indexes of association turn out to represent the former
whether by coincidence or equivalence.  The former could also in some broad
sense encompass concordance measures like Goodman and Kruskal's gamma /
Yule's Q.  In the do-file below, Pearson's correlation coefficient of the
binary variables is returned as rho_manvar (rho of the manifest variables,
as opposed to that of the latent variables, which is rho_latvar).  Several
of the measures illustrated are from user-written commands, for
example, -somersd-, -polychoric-, -tetrac- (an approximation of
what -polychoric- gives in this case) and -reoprob-.  You'll need to have
these (as well as -slist-) installed.

Joseph Coveney

set more off
set seed `=date("2005-04-18", "ymd")'
program define corbingen, rclass
    version 8.2
    drawnorm mu0 mu1, corr(1 `1' \ `1' 1) n(200) clear
    correlate mu0 mu1
    return scalar rho_latvar = r(rho)
    replace mu0 = mu0 > 0
    replace mu1 = mu1 > 0
    tabulate mu*, all
    return scalar gamma = r(gamma)
    return scalar CramersV = r(CramersV)
    return scalar taub = r(taub)
    somersd mu0 mu1
    matrix A = e(b)
    return scalar somersd = A[1,1]
    correlate mu*
    return scalar rho_manvar = r(rho)
    kap mu0 mu1
    return scalar kappa = r(kappa)
    dprobit mu0 mu1
    matrix A = e(dfdx)
    return scalar dFdx = A[1,1]
    tetrac mu0 mu1
    return scalar tetrac = r(tetra)
    polychoric mu0 mu1
    return scalar polychoric = r(rho)
    generate int rec = _n
    reshape long mu, i(rec) j(tim)
    reoprob mu tim, i(rec)
    matrix A = e(b)
    return scalar reprobit_rho = A[1,3]
forvalues rho = 0.1(0.1)0.9 {
    local R = round(10 * `rho', 1)
    tempfile sim`R'
    simulate "corbingen `rho'" rho_latvar = r(rho_latvar) ///
      gamma = r(gamma) tetrac = r(tetrac) ///
      polychoric = r(polychoric) ///
      reprobit_rho = r(reprobit_rho) ///
      rho_manvar = r(rho_manvar) CramersV = r(CramersV) taub = r(taub) ///
      somersd = r(somersd) dFdx = r(dFdx) ///
      kappa = r(kappa), reps(1) saving(`sim`R'')
forvalues R = 1/8 {
    append using `sim`R''
sort rho_latvar
slist, noobs decimal(3)
assert rho_manvar == taub
assert rho_manvar == CramersV
pause on
graph7 kappa dFdx gamma rho_manvar rho_manvar, ///
  xlabel ylabel connect(..LL) symbol(oTii)
graph7 gamma reprobit_rho polychoric rho_latvar ///
  rho_latvar, xlabel ylabel connect(...L) symbol(oTxi)

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index