Re: st: Correlation coefficient between discrete and continuous variables

 From "Austin Nichols" To statalist@hsphsun2.harvard.edu Subject Re: st: Correlation coefficient between discrete and continuous variables Date Thu, 20 Nov 2008 13:38:19 -0500

```Sergiy--
Might the -oprobit- command do what you want?

Maybe someone with more ordered probit expertise can comment on this example:

sysuse auto, clear
center rep78 price, c s
qui reg c_*
di _b[c_price]
qui corr rep78 price
di r(rho)
oprobit c_rep78 c_price
oprobit rep78 c_price

(-center- is from SSC).

> Dear All,
>
> a colleague of mine has just hinted me that it may not be
> straightforward to compute a correlation coefficient when one of the
> variables is discrete. Until now I never cared, and neither does the
> Stata manual. In particular it does not require anywhere the variables
> to be continuous, and the example shows the use of -correlate- command
> to find a correlation between such discrete variables as -state- and
> -region- and such continuous variables as -marriage rate-, -divorce
> rate- (which is also strange since there is no logical ordering of
> -state- and -region-, but that is a different issue).
>
> After looking into the literature, the following paper seems to be
> most relevant:
>
>   N.R.Cox "Estimation of the Correlation between a Continuous and a
> Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178
>   www.jstor.org/stable/2529626
>
> In particular my case satisfies the assumptions made in the paper that
> the discrete value is derived from an underlying continuous variable
> (so there is ordering: low, medium, or high).The way it is recommended
> in the paper seems very far away from what Stata seems to be computing
> according to the manual, in particular it calls for iterative maximum
> likelihood estimation.
>
> Before I start writing any code myself, I would like to ask:
>
> Q1: does Stata do any adjustment to the way it computes the
> correlation coefficient based on the nature of the variable (discrete
> or continuous)?
>
> Q2: is the difference between (the correlation coefficient as
> estimated by Stata in this case) and (the one computed by the
> recommended way) practically important?
>
> Q3: is there any standard or user-written command to compute the
> correlation coefficient according to the method described in the paper
> above?
>
> Q4:I am ultimately interested in the correlation between my observed
> continuous variable and the unobserved continuous variable, which is
> represented in the discrete levels. Unfortunately the thresholds are
> not available to me, so I may not be sure about the size of the
> intervals. Furthermore, a significant measurement error may be
> involved, since many interviewers may have eyeballed the continuous
> variable into different groups differently. Should I instead focus on
> different measures of correlation? Could you please suggest any ones
> that better fit the context?
>
> Thank you,