Re: st: Correlation coefficient between discrete and continuous variables

 From David Airey To statalist@hsphsun2.harvard.edu Subject Re: st: Correlation coefficient between discrete and continuous variables Date Thu, 20 Nov 2008 14:52:07 -0600

A point biserial correlation is the same as a ttest is the same as a regression with dummy....
```
***
clear
set obs 100
matrix C = (1, .5 \ .5, 1)
corr2data x y, n(100) corr(C) seed(1234)
corr y x
gen dummy = cond(x >.5,1,0)
corr y dummy
local t = abs(r(rho) / sqrt((1-r(rho)^2)/(r(N)-2)))
display "t = "`t'
local p = ttail(r(N)-2,`t') * 2
display "p = "`p'
ttest y, by(dummy)
regress y dummy
display sqrt(e(r2))
***

On Nov 20, 2008, at 12:08 PM, Sergiy Radyakin wrote:

```
```Dear All,

a colleague of mine has just hinted me that it may not be
straightforward to compute a correlation coefficient when one of the
variables is discrete. Until now I never cared, and neither does the
Stata manual. In particular it does not require anywhere the variables
to be continuous, and the example shows the use of -correlate- command
to find a correlation between such discrete variables as -state- and
-region- and such continuous variables as -marriage rate-, -divorce
rate- (which is also strange since there is no logical ordering of
-state- and -region-, but that is a different issue).

After looking into the literature, the following paper seems to be
most relevant:

N.R.Cox "Estimation of the Correlation between a Continuous and a
Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178
www.jstor.org/stable/2529626

In particular my case satisfies the assumptions made in the paper that
the discrete value is derived from an underlying continuous variable
(so there is ordering: low, medium, or high).The way it is recommended
in the paper seems very far away from what Stata seems to be computing
according to the manual, in particular it calls for iterative maximum
likelihood estimation.

Before I start writing any code myself, I would like to ask:

Q1: does Stata do any adjustment to the way it computes the
correlation coefficient based on the nature of the variable (discrete
or continuous)?

Q2: is the difference between (the correlation coefficient as
estimated by Stata in this case) and (the one computed by the
recommended way) practically important?

Q3: is there any standard or user-written command to compute the
correlation coefficient according to the method described in the paper
above?

Q4:I am ultimately interested in the correlation between my observed
continuous variable and the unobserved continuous variable, which is
represented in the discrete levels. Unfortunately the thresholds are
not available to me, so I may not be sure about the size of the
intervals. Furthermore, a significant measurement error may be
involved, since many interviewers may have eyeballed the continuous
variable into different groups differently. Should I instead focus on
different measures of correlation? Could you please suggest any ones
that better fit the context?

Thank you,
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```
```
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```