Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: collinearity in categorical variables

From   Steve Samuels <>
Subject   Re: st: collinearity in categorical variables
Date   Fri, 26 Apr 2013 18:56:46 -0400


Solutions involving polychoric correlations don't appeal at all. They
are dependent on the assumed multivariate Gaussian model for the
unobserved latent variables. That's a strong assumption, and you must,
in addition, estimate the correlation of the latent variables with the
observed continuous predictors. Even then, there is no guarantee that
the mashed-together correlation matrix will be positive definite.

So I suggest that you just go ahead and use -perturb-.

Shrinkage techniques are likely to both diagnose and ameliorate problems
of highly correlated variables. See Le Cessie and Van Houwelingen (1992)
and Harrell (2001). For penalized logistic or lasso (Tibshirani, 2011),
try -plogit- by Gareth Ambler. See: You can install it
from within Stata by typing:

. net from 
. net install plogit



S Le Cessie and JC Van Houwelingen (1992) Ridge estimators in logistic
regression. Applied statistics 191-201.

Harrell, Frank E. 2001. Regression modeling strategies : with
applications to linear models, logistic regression, and survival
analysis. New York: Springer.

Tibshirani, Robert. 1996. Regression shrinkage and selection via the
lasso. Journal of the Royal Statistical Society. Series B
(Methodological) 267-288.

Tibshirani, Robert. 2011. Regression shrinkage and selection via the
lasso: a retrospective. Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 73, no. 3: 273-282.

On Apr 26, 2013, at 11:24 AM, David Hoaglin wrote:


To get information on "correlation" between two categorical variables,
a crosstab would be a good start.  The idea is to look at the data in
detail before (or instead of) reducing the relation of the two
variables to a single number.

The "variance inflation factor" (VIF) is defined for an individual
predictor variable.  Conceptually, one uses that predictor as the
dependent variable in a regression on all the other predictors, and
interprets 1 - R^2 from the regression as the "usable fraction" of
that predictor in the full regression model.  The VIF for that
predictor is the reciprocal of that 1 - R^2.

A VIF gives information how badly the standard error for the estimate
of that regression coefficient is degraded, but it does not show which
predictors are involved in the collinearity (if a troublesome
collinearity is present).  That's where -coldiag2-  and related
methods come in.

I don't know whether someone has generalized VIF to categorical
variables, but you would not need that if you applied -coldiag2- to
the full set of predictors.  That diagnosis is designed for OLS
regression, but it is often useful for logistic regression.

I'll follow the links when I get a chance.

David Hoaglin

On Fri, Apr 26, 2013 at 8:58 AM, Mitchell F. Berman <> wrote:
> Thank you for the reply.  Yes, I see that for a single categorical variable
> broken into dummy variables, collinearity between the dummy variables would
> be zero.
> But my question concerns correlation between related, similar, categorical
> variables.
> If I have multiple similar categorical variables, for example: homebound,
> uses a walker, home-health aide, lives in nursing home, these categorical
> variables will move together though the data--- won't be identical for all
> patients, but correlated.
> People mention standard VIF (which I know how to do), but the more thorough
> answers imply this is not correct.
> This links suggests perturb (a module available for Stata, R, and SPSS) or
> polychoric correlation
> This link from talkstats suggests that polychoric correlations (available in
> R) are preferable, because correlations calculated using pearson product
> moment are invalid for categorical data.
> someone else suggested spearman correlation coefficient
> factor analysis
> This is beyond my level of theoretical understanding.  I was trying to get a
> sense of what the experts on the Stata List server use.
> Thank you for any additional input.
> Mitchell
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index