[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
David Airey <david.airey@vanderbilt.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Correlation coefficient between discrete and continuous variables |

Date |
Thu, 20 Nov 2008 14:52:07 -0600 |

*** clear set obs 100 matrix C = (1, .5 \ .5, 1) corr2data x y, n(100) corr(C) seed(1234) corr y x gen dummy = cond(x >.5,1,0) corr y dummy local t = abs(r(rho) / sqrt((1-r(rho)^2)/(r(N)-2))) display "t = "`t' local p = ttail(r(N)-2,`t') * 2 display "p = "`p' ttest y, by(dummy) regress y dummy display sqrt(e(r2)) *** On Nov 20, 2008, at 12:08 PM, Sergiy Radyakin wrote:

Dear All, a colleague of mine has just hinted me that it may not be straightforward to compute a correlation coefficient when one of the variables is discrete. Until now I never cared, and neither does the Stata manual. In particular it does not require anywhere the variables to be continuous, and the example shows the use of -correlate- command to find a correlation between such discrete variables as -state- and -region- and such continuous variables as -marriage rate-, -divorce rate- (which is also strange since there is no logical ordering of -state- and -region-, but that is a different issue). After looking into the literature, the following paper seems to be most relevant: N.R.Cox "Estimation of the Correlation between a Continuous and a Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178 www.jstor.org/stable/2529626 In particular my case satisfies the assumptions made in the paper that the discrete value is derived from an underlying continuous variable (so there is ordering: low, medium, or high).The way it is recommended in the paper seems very far away from what Stata seems to be computing according to the manual, in particular it calls for iterative maximum likelihood estimation. Before I start writing any code myself, I would like to ask: Q1: does Stata do any adjustment to the way it computes the correlation coefficient based on the nature of the variable (discrete or continuous)? Q2: is the difference between (the correlation coefficient as estimated by Stata in this case) and (the one computed by the recommended way) practically important? Q3: is there any standard or user-written command to compute the correlation coefficient according to the method described in the paper above? Q4:I am ultimately interested in the correlation between my observed continuous variable and the unobserved continuous variable, which is represented in the discrete levels. Unfortunately the thresholds are not available to me, so I may not be sure about the size of the intervals. Furthermore, a significant measurement error may be involved, since many interviewers may have eyeballed the continuous variable into different groups differently. Should I instead focus on different measures of correlation? Could you please suggest any ones that better fit the context? Thank you, Sergiy Radyakin * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Correlation coefficient between discrete and continuous variables***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**RE: st: multicollinearity** - Next by Date:
**Re: st: multicollinearity** - Previous by thread:
**st: RE: RE: RE: Correlation coefficient between discrete and continuous variables** - Next by thread:
**st: Separation anxiety (extreme penalty if mistype -separate- command)** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |