[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Correlation coefficient between discrete and continuous variables

From   "Lachenbruch, Peter" <>
To   <>
Subject   st: RE: Correlation coefficient between discrete and continuous variables
Date   Thu, 20 Nov 2008 12:09:53 -0800

When the discrete variable is a dichotomy, the test for the correlation
is the same as the two-sample t-test with the dichotomy as the 'by'
variable.  I've never looked at the case for discrete variables with k
categories.  A little algebra could be fun and illuminating...  (or not)


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
[] On Behalf Of Sergiy
Sent: Thursday, November 20, 2008 10:08 AM
Subject: st: Correlation coefficient between discrete and continuous

Dear All,

a colleague of mine has just hinted me that it may not be
straightforward to compute a correlation coefficient when one of the
variables is discrete. Until now I never cared, and neither does the
Stata manual. In particular it does not require anywhere the variables
to be continuous, and the example shows the use of -correlate- command
to find a correlation between such discrete variables as -state- and
-region- and such continuous variables as -marriage rate-, -divorce
rate- (which is also strange since there is no logical ordering of
-state- and -region-, but that is a different issue).

After looking into the literature, the following paper seems to be
most relevant:

   N.R.Cox "Estimation of the Correlation between a Continuous and a
Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178

In particular my case satisfies the assumptions made in the paper that
the discrete value is derived from an underlying continuous variable
(so there is ordering: low, medium, or high).The way it is recommended
in the paper seems very far away from what Stata seems to be computing
according to the manual, in particular it calls for iterative maximum
likelihood estimation.

Before I start writing any code myself, I would like to ask:

Q1: does Stata do any adjustment to the way it computes the
correlation coefficient based on the nature of the variable (discrete
or continuous)?

Q2: is the difference between (the correlation coefficient as
estimated by Stata in this case) and (the one computed by the
recommended way) practically important?

Q3: is there any standard or user-written command to compute the
correlation coefficient according to the method described in the paper

Q4:I am ultimately interested in the correlation between my observed
continuous variable and the unobserved continuous variable, which is
represented in the discrete levels. Unfortunately the thresholds are
not available to me, so I may not be sure about the size of the
intervals. Furthermore, a significant measurement error may be
involved, since many interviewers may have eyeballed the continuous
variable into different groups differently. Should I instead focus on
different measures of correlation? Could you please suggest any ones
that better fit the context?

Thank you,
   Sergiy Radyakin
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index