# st: RE: Correlation coefficient between discrete and continuous variables

 From "Nick Cox" To Subject st: RE: Correlation coefficient between discrete and continuous variables Date Thu, 20 Nov 2008 18:53:14 -0000

As so often with questionnaires I find I'd rather suggest recasting the
basis on which something is being presented.

Stata clearly has an idea of data type, but as you know well that is
essentially a matter of how variables are stored. Stata itself has no
idea of discrete or continuous; that is largely in the mind of the
beholder and at least partly often a matter of convention or convenience
in data collection, data analysis or both.

Thus income is a continuous variable in principle in terms of the theory
that people find it congenial to write; a discrete variable in principle
in so far as no currency is indefinitely divisible; and sometimes a
highly discrete variable in practice in so far as it arrives in the
researcher's datasets in a rounded, coarsened, or binned manner.

Conversely, many people feel free to take grades assigned to students in
courses, forget their original discreteness, and correlate grade point
averages computed to several decimals with other grade point averages
and other stuff with complete impunity. There would be a large hole in
social science research if everybody took measurement theory seriously,
as its proponents no doubt feel keenly.

Stata will -correlate- anything and everything numeric fed to it without
discrimination and without varying what it does. Both variables can be
discrete -- even binary -- and a correlation can still make some sense;
for highly discrete data it is unlikely to be the best thing to
calculate, but that is a different story. Conversely, both variables can
be continuous in every sense but the finiteness of computer
representation, yet a correlation nevertheless a silly or pointless
thing to calculate, because the relationship is highly nonlinear or even
non-monotonic. Stata basically leaves the user to decide, or to feel
free to do anything however stupid or sensible. The alternative is a
nanny program that encodes people's prejudices on which methods are
appropriate when. If you are a software writer and try that, all that
becomes apparent very quickly is that your prejudices don't match other
people's.

In my book, correlation does not really require continuity of the data,
just that in principle linear predictability makes sense in some way,
almost regardless of how the data were produced or are presented. Indeed
Spearman correlation for example makes perfect sense as the correlation
between ranks, which themselves are _defined_ as discrete.

Your question raises issues on which reasonable statistically-minded
people have been disagreeing, sometimes vehemently, for at least a
century. Pearson and Yule fell out over whether discrete variables (even
being alive or dead) were really better thought of as approximations to
underlying or latent continuous variables. Continued enthusiasm in many
places for factor analysis or projection pursuit with discrete data
shows that the debate is just alive in 2008 as it was in 1908.

On your specifics, I don't recollect an implementation of N.R. Cox's
method. That wasn't me, just in case somebody wonders. Try -findit- or

Nick
n.j.cox@durham.ac.uk

a colleague of mine has just hinted me that it may not be
straightforward to compute a correlation coefficient when one of the
variables is discrete. Until now I never cared, and neither does the
Stata manual. In particular it does not require anywhere the variables
to be continuous, and the example shows the use of -correlate- command
to find a correlation between such discrete variables as -state- and
-region- and such continuous variables as -marriage rate-, -divorce
rate- (which is also strange since there is no logical ordering of
-state- and -region-, but that is a different issue).

After looking into the literature, the following paper seems to be
most relevant:

N.R.Cox "Estimation of the Correlation between a Continuous and a
Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178
www.jstor.org/stable/2529626

In particular my case satisfies the assumptions made in the paper that
the discrete value is derived from an underlying continuous variable
(so there is ordering: low, medium, or high).The way it is recommended
in the paper seems very far away from what Stata seems to be computing
according to the manual, in particular it calls for iterative maximum
likelihood estimation.

Before I start writing any code myself, I would like to ask:

Q1: does Stata do any adjustment to the way it computes the
correlation coefficient based on the nature of the variable (discrete
or continuous)?

Q2: is the difference between (the correlation coefficient as
estimated by Stata in this case) and (the one computed by the
recommended way) practically important?

Q3: is there any standard or user-written command to compute the
correlation coefficient according to the method described in the paper
above?

Q4:I am ultimately interested in the correlation between my observed
continuous variable and the unobserved continuous variable, which is
represented in the discrete levels. Unfortunately the thresholds are
not available to me, so I may not be sure about the size of the
intervals. Furthermore, a significant measurement error may be
involved, since many interviewers may have eyeballed the continuous
variable into different groups differently. Should I instead focus on
different measures of correlation? Could you please suggest any ones
that better fit the context?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/