[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Correlation coefficient between discrete and continuous variables |

Date |
Thu, 20 Nov 2008 18:53:14 -0000 |

As so often with questionnaires I find I'd rather suggest recasting the basis on which something is being presented. Stata clearly has an idea of data type, but as you know well that is essentially a matter of how variables are stored. Stata itself has no idea of discrete or continuous; that is largely in the mind of the beholder and at least partly often a matter of convention or convenience in data collection, data analysis or both. Thus income is a continuous variable in principle in terms of the theory that people find it congenial to write; a discrete variable in principle in so far as no currency is indefinitely divisible; and sometimes a highly discrete variable in practice in so far as it arrives in the researcher's datasets in a rounded, coarsened, or binned manner. Conversely, many people feel free to take grades assigned to students in courses, forget their original discreteness, and correlate grade point averages computed to several decimals with other grade point averages and other stuff with complete impunity. There would be a large hole in social science research if everybody took measurement theory seriously, as its proponents no doubt feel keenly. Stata will -correlate- anything and everything numeric fed to it without discrimination and without varying what it does. Both variables can be discrete -- even binary -- and a correlation can still make some sense; for highly discrete data it is unlikely to be the best thing to calculate, but that is a different story. Conversely, both variables can be continuous in every sense but the finiteness of computer representation, yet a correlation nevertheless a silly or pointless thing to calculate, because the relationship is highly nonlinear or even non-monotonic. Stata basically leaves the user to decide, or to feel free to do anything however stupid or sensible. The alternative is a nanny program that encodes people's prejudices on which methods are appropriate when. If you are a software writer and try that, all that becomes apparent very quickly is that your prejudices don't match other people's. In my book, correlation does not really require continuity of the data, just that in principle linear predictability makes sense in some way, almost regardless of how the data were produced or are presented. Indeed Spearman correlation for example makes perfect sense as the correlation between ranks, which themselves are _defined_ as discrete. Your question raises issues on which reasonable statistically-minded people have been disagreeing, sometimes vehemently, for at least a century. Pearson and Yule fell out over whether discrete variables (even being alive or dead) were really better thought of as approximations to underlying or latent continuous variables. Continued enthusiasm in many places for factor analysis or projection pursuit with discrete data shows that the debate is just alive in 2008 as it was in 1908. On your specifics, I don't recollect an implementation of N.R. Cox's method. That wasn't me, just in case somebody wonders. Try -findit- or Google as usual. Nick n.j.cox@durham.ac.uk Sergiy Radyakin a colleague of mine has just hinted me that it may not be straightforward to compute a correlation coefficient when one of the variables is discrete. Until now I never cared, and neither does the Stata manual. In particular it does not require anywhere the variables to be continuous, and the example shows the use of -correlate- command to find a correlation between such discrete variables as -state- and -region- and such continuous variables as -marriage rate-, -divorce rate- (which is also strange since there is no logical ordering of -state- and -region-, but that is a different issue). After looking into the literature, the following paper seems to be most relevant: N.R.Cox "Estimation of the Correlation between a Continuous and a Discrete Variable", Biometrics, Vol.30, No.1 (Mar., 1974), pp. 171-178 www.jstor.org/stable/2529626 In particular my case satisfies the assumptions made in the paper that the discrete value is derived from an underlying continuous variable (so there is ordering: low, medium, or high).The way it is recommended in the paper seems very far away from what Stata seems to be computing according to the manual, in particular it calls for iterative maximum likelihood estimation. Before I start writing any code myself, I would like to ask: Q1: does Stata do any adjustment to the way it computes the correlation coefficient based on the nature of the variable (discrete or continuous)? Q2: is the difference between (the correlation coefficient as estimated by Stata in this case) and (the one computed by the recommended way) practically important? Q3: is there any standard or user-written command to compute the correlation coefficient according to the method described in the paper above? Q4:I am ultimately interested in the correlation between my observed continuous variable and the unobserved continuous variable, which is represented in the discrete levels. Unfortunately the thresholds are not available to me, so I may not be sure about the size of the intervals. Furthermore, a significant measurement error may be involved, since many interviewers may have eyeballed the continuous variable into different groups differently. Should I instead focus on different measures of correlation? Could you please suggest any ones that better fit the context? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Correlation coefficient between discrete and continuous variables***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**RE: st: Create a normalized variable** - Next by Date:
**Re: st: Correlation coefficient between discrete and continuous variables** - Previous by thread:
**Re: st: Correlation coefficient between discrete and continuous variables** - Next by thread:
**Re: st: Correlation coefficient between discrete and continuous variables** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |