Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: collinearity in categorical variables

 From David Hoaglin To statalist@hsphsun2.harvard.edu Subject Re: st: collinearity in categorical variables Date Fri, 26 Apr 2013 11:24:38 -0400

```Mitchell,

To get information on "correlation" between two categorical variables,
a crosstab would be a good start.  The idea is to look at the data in
detail before (or instead of) reducing the relation of the two
variables to a single number.

The "variance inflation factor" (VIF) is defined for an individual
predictor variable.  Conceptually, one uses that predictor as the
dependent variable in a regression on all the other predictors, and
interprets 1 - R^2 from the regression as the "usable fraction" of
that predictor in the full regression model.  The VIF for that
predictor is the reciprocal of that 1 - R^2.

A VIF gives information how badly the standard error for the estimate
of that regression coefficient is degraded, but it does not show which
predictors are involved in the collinearity (if a troublesome
collinearity is present).  That's where -coldiag2-  and related
methods come in.

I don't know whether someone has generalized VIF to categorical
variables, but you would not need that if you applied -coldiag2- to
the full set of predictors.  That diagnosis is designed for OLS
regression, but it is often useful for logistic regression.

David Hoaglin

On Fri, Apr 26, 2013 at 8:58 AM, Mitchell F. Berman <mfb1@columbia.edu> wrote:
> Thank you for the reply.  Yes, I see that for a single categorical variable
> broken into dummy variables, collinearity between the dummy variables would
> be zero.
> But my question concerns correlation between related, similar, categorical
> variables.
>
> If I have multiple similar categorical variables, for example: homebound,
> uses a walker, home-health aide, lives in nursing home, these categorical
> variables will move together though the data--- won't be identical for all
> patients, but correlated.
>
> People mention standard VIF (which I know how to do), but the more thorough
> answers imply this is not correct.
>
> This links suggests perturb (a module available for Stata, R, and SPSS) or
> polychoric correlation
> http://stats.stackexchange.com/questions/35233/how-to-test-for-and-remedy-multicollinearity-in-optimal-scaling-ordinal-regressi
>
> This link from talkstats suggests that polychoric correlations (available in
> R) are preferable, because correlations calculated using pearson product
> moment are invalid for categorical data.
>
> someone else suggested spearman correlation coefficient
>
> factor analysis
>
> This is beyond my level of theoretical understanding.  I was trying to get a
> sense of what the experts on the Stata List server use.
>
> Thank you for any additional input.
>
> Mitchell
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```