# Re: st: Correlation coefficient between discrete and continuous variables

 From "Sergiy Radyakin" To statalist@hsphsun2.harvard.edu Subject Re: st: Correlation coefficient between discrete and continuous variables Date Thu, 20 Nov 2008 15:23:19 -0500

```Dear Austin, thank you for having a look at it. My feeling is that
this is the nature of the problem and the correlation coefficients
estimated in the two ways will differ similarly to how the
coefficients of linear probability model differ from coefficients in
discrete outcome models. And in case of unequal width intervals I'd
give preference to the probit.

In your example the difference in estimated rho is 0.00655327 versus
0.0077479 if I got it right. Which is something I would not spend too
much time on. But have a look at this example:

sysuse auto, clear
egen weight_group=cut(weight), at(0,2000,2500,3000,4000,5000)
center weight_group price, c s
qui reg c_*
di _b[c_price]
qui corr weight_group price
di r(rho)
oprobit c_weight_group c_price
oprobit weight_group c_price

Here the difference is 0.40929567 versus 0.6149637 and this matters for my case.

Thank you Steven for pointing to -polychoric- command (which in the
case above results in 0.51170656). Unless there is a critique of it, I
will be using it for my computations.

Best regards,

On Thu, Nov 20, 2008 at 1:38 PM, Austin Nichols <austinnichols@gmail.com> wrote:
> Sergiy--
> Might the -oprobit- command do what you want?
>
> Maybe someone with more ordered probit expertise can comment on this example:
>
> sysuse auto, clear
> center rep78 price, c s
> qui reg c_*
> di _b[c_price]
> qui corr rep78 price
> di r(rho)
> oprobit c_rep78 c_price
> oprobit rep78 c_price
>
> (-center- is from SSC).
>
>> Dear All,
>>
>> a colleague of mine has just hinted me that it may not be
>> straightforward to compute a correlation coefficient when one of the
>> variables is discrete. Until now I never cared, and neither does the
>> Stata manual. In particular it does not require anywhere the variables
>> to be continuous, and the example shows the use of -correlate- command
>> to find a correlation between such discrete variables as -state- and
>> -region- and such continuous variables as -marriage rate-, -divorce
>> rate- (which is also strange since there is no logical ordering of
>> -state- and -region-, but that is a different issue).
>>
>> After looking into the literature, the following paper seems to be
>> most relevant:
>>
>>   N.R.Cox "Estimation of the Correlation between a Continuous and a
>> Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178
>>   www.jstor.org/stable/2529626
>>
>> In particular my case satisfies the assumptions made in the paper that
>> the discrete value is derived from an underlying continuous variable
>> (so there is ordering: low, medium, or high).The way it is recommended
>> in the paper seems very far away from what Stata seems to be computing
>> according to the manual, in particular it calls for iterative maximum
>> likelihood estimation.
>>
>> Before I start writing any code myself, I would like to ask:
>>
>> Q1: does Stata do any adjustment to the way it computes the
>> correlation coefficient based on the nature of the variable (discrete
>> or continuous)?
>>
>> Q2: is the difference between (the correlation coefficient as
>> estimated by Stata in this case) and (the one computed by the
>> recommended way) practically important?
>>
>> Q3: is there any standard or user-written command to compute the
>> correlation coefficient according to the method described in the paper
>> above?
>>
>> Q4:I am ultimately interested in the correlation between my observed
>> continuous variable and the unobserved continuous variable, which is
>> represented in the discrete levels. Unfortunately the thresholds are
>> not available to me, so I may not be sure about the size of the
>> intervals. Furthermore, a significant measurement error may be
>> involved, since many interviewers may have eyeballed the continuous
>> variable into different groups differently. Should I instead focus on
>> different measures of correlation? Could you please suggest any ones
>> that better fit the context?
>>
>> Thank you,
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```