Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: RE: st: RE: Econometrics Theory Questions on Dummies and Correlation Analysis


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: RE: st: RE: Econometrics Theory Questions on Dummies and Correlation Analysis
Date   Mon, 18 Apr 2005 22:39:48 +0100

Your first paragraph strikes me as somewhat confused. 
The variance of a binary variable is a perfectly 
well-defined and meaningful quantity. It would 
be difficult to think about e.g. confidence 
intervals of proportions otherwise. 

According to one paper in the American Statistician, there are 
12 ways to think about the correlation; someone 
then wrote a note about a 13th; and no doubt there 
are more. So there are several things that 
could be emphasised. My taste is to stress
(in teaching and otherwise) that correlation 
measures linearity of relationship, but this depends 
on their being a relationship to summarize in 
the first place, which depends on positive 
variances in both variables. 

I agree that means and variances of
non-binary nominal codes don't usually make 
sense. The only exception I can think of 
is that zero variance has a clearcut 
interpretation, namely that all observations 
are in the same category. But you clearly 
don't need the variance to tell you that, 
and I don't think this is a point at issue 
in this thread. 

But as pointed out in a earlier post of mine, 
part of the problem here is that "nominal" is 
often used to lump together measurement scales
that are quite different, namely binary and 
polytomous. 

Nick 
n.j.cox@durham.ac.uk 

SamL
 
> I thought the technical reason was that for binary variables 
> the variance
> is a function of the mean.  So, there's no more info in the 
> variance than
> there is in the mean.  So one doesn't calculate (or, more important,
> present or analyze) variances or covariances of binary variables.
> 
> Variances and covariances are building blocks of the Pearson 
> correlation
> coefficient (under one derivation), so . . . if calculating 
> variances of
> binary variables is not informative, shouldn't one expect the 
> correlation
> coefficient to be uninformative or, worse, biased?
> 
> Around another way, one would probably not report the mean 
> (or variance)
> of a nominal variable (e.g., Christian=1, Muslim=2, 
> Buddhist=3, Other=4,
> so the mean is 1.8?).  Thus, one doesn't use statistics based 
> on means and
> variances with such variables.
> 
> That, I thought, was the technical reason.  If I am mistaken, 
> I'd love to
> learn that, in the spirit of deepening knowledge and being 
> sure to teach
> students what's what.
> 
> Respectfully,
> Sam
> 
> On Mon, 18 Apr 2005, Paul Millar wrote:
> 
> > I don't have any quarrel about whether correlational 
> techniques *can* be applied here.  Clearly there is no 
> technical or mathematical reason why they shouldn't be.
> >
> > The question, as I read it, was rather are there any 
> *non-technical* reasons for which there may be objections to 
> using these techniques for binary data.  Since I don't know 
> the data, or what is being measured, I cannot determine the 
> level of measurement of the variable solely from the fact 
> that it is binary.  One can assume that it is at least 
> nominal, but it can, at least in theory, be any of the four 
> possibilities (albeit at low precision).  Certainly, binary 
> data is not *always* ordinal.  For example, blue eyes are not 
> necessarily of higher rank than other eye colours, so a 
> binary indicator for blue eyes would be nominal.  One can 
> assert other levels of measurement, but usually with some 
> sort of justification.  I don't recall any requirement of 
> logistic regression with respect to level of measurement of 
> the dependent variable, so long as it is binary.  I stand by 
> the general principle that, other factors held constant, 
> simpler techniques are preferable.
> >
> > Anyway, I still claim that there are (non-technical) 
> reasons that one might choose not to use the techniques 
> described, even if one is being a bit nit-picky.  One can 
> still use them of course, it should just be accompanied with 
> a comment on why.
> >
> > - Paul
> >
> > ----- Original Message -----
> > From: Nick Cox <n.j.cox@durham.ac.uk>
> > Date: Monday, April 18, 2005 12:20 pm
> > Subject: RE: st: RE: Econometrics Theory Questions on 
> Dummies and Correlation Analysis
> >
> > > There is much good advice here, but it still
> > > is further than I would go, and bound up
> > > with a more literal reading of the assertions
> > > of Stanley Smith Stevens
> > >
> > > http://www.nap.edu/openbook/0309022452/html/424.html
> > >
> > > and others on nominal, ordinal, interval and ratio
> > > scales, and what you can do with them, than seems defensible.
> > >
> > > Also, arguments about what was designed to do what
> > > don't help much here. The techniques work the
> > > way they work because of the mathematics of what is
> > > being done, not according to what was in the
> > > inventor's mind at the time. Anyway, historically,
> > > this is a most dangerous tack, as it was (Karl) Pearson
> > > above all others who thought that correlations could
> > > be pulled out of categorical data in all sorts of ways:
> > > you just needed the right formula to do it.
> > >
> > > Regression (correlation if anyone insists, but the logic
> > > is the same)  can't discern the categorical origins
> > > of dummy variables. It just sees 0s and 1s.
> > >
> > > At one extreme, suppose you have two identical
> > > dummy variables (and some variation in each).
> > > In terms of a scatter plot, you have two clusters,
> > > one at the origin (0,0) and one at (1,1), like this
> > >
> > >
> > >                  *
> > >
> > >
> > >
> > >
> > >     *
> > >
> > >
> > > and a straight line is a perfect summary of such
> > > data, and so the Pearson correlation is identically 1.
> > > Also, this on the RHS of a model has implications
> > > for the model. In practice, as Paul emphasises, you
> > > would do well to count the numbers as well, but this
> > > result holds irrespective of coding and it is perfectly
> > > sensible statistically.
> > >
> > > More generally, for paired dummies you have clusters of zero or
> > > more data at (0,0), (0,1), (1,0) and (1,1)
> > > and the correlation you get will depend on the
> > > "votes cast" by each of those clusters. In many
> > > cases, the results won't be especially easy
> > > to interpret, but they are not crazy or stupid.
> > > Mind you, almost no correlation is easy to
> > > interpret without looking at the corresponding scatter plot,
> > > so nothing has changed there.
> > >
> > > I don't think the case of Spearman correlation
> > > needs much extra discussion. Note that binary scales
> > > are always ordinal. In correlating, the signs may
> > > be arbitrary, but the magnitudes of Spearman
> > > correlations won't be.
> > >
> > > In fact, in many cases they
> > > are counts too, in a perhaps strained sense (how
> > > many women inside this person? answer: either 0 or 1).
> > >
> > > Note that no one, to the best of my knowledge, argues
> > > that logit regression is inapplicable to binary
> > > responses because you can't (shouldn't) apply such techniques
> > > to "nominal" data!
> > >
> > > Nick
> > > n.j.cox@durham.ac.uk
> > >
> > > Paul Millar
> > >
> > > > on Dummies and Correlation Analysis...
> > > >
> > > > 1. Is there any theory that prohibit one from undertaking a
> > > >    correlation analysis (i.e., correlation matrix) with either
> > > >    with Pearson or Spearman rank correlation test on variables,
> > > >    which are all dummies?
> > > >
> > > > Although technically there doesn't seem to be anything
> > > > preventing the kind of analysis you propose, from a
> > > > theoretical (or at least methodological) point of view you
> > > > wouldn't normally use this method for at least two reasons.
> > > > 1) The level of measurement of the variables does not
> > > > coincide with the level of measurement of the techniques.
> > > > Pearson correlations are designed for interval (or ratio)
> > > > measures and Spearman for ordinal.  You have nominal measures
> > > > (or so it seems).
> > > > 2) It is more complex than required, and potentially
> > > > obscures, rather than helps, understanding of the
> > > > relationships between the variables.  A series of simple
> > > > crosstabs might be more illuminating.
> > > > From a methodological point of view, a compelling reason to
> > > > overcome these objections would be advisable to make your
> > > > choice of method more defensible.
> > > >
> > > > 2. If there is no prohibition, theory wise, can the bivariate
> > > >    correlation coeficients for the dummy variables be 
> interpreted
> > > >    in the same way as one would do with continuous variables?
> > > > As stated above, the interpretation would require that you
> > > > treat your nominal measures as if they are interval or
> > > > ordinal.  You need to justify this treatment before
> > > > interpretation, at least if you are picky picky picky.
> > > >
> > > > - Paul Millar
> > > > Sociology
> > > > University of Calgary
> > > >

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index