Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: collinearity in categorical variables

From   Steve Samuels <>
Subject   Re: st: collinearity in categorical variables
Date   Fri, 26 Apr 2013 11:14:20 -0400

The following technical report discusses multicollinearity and
categorical variables.

Malte Wissmann & Helge Toutenburg & Shalabh Role of Categorical
Variables in Multicollinearity in the Linear Regression Model Technical
Report Number 008, 2007 Department of Statistics University of Munich

Wissman et al. has two  references to the perturbation approach to detecting
collinearity .

DA Belsley, Conditioning Diagnostics: Collinearity and Weak Data in
Regression, 1 ed., John Wiley & Sons, Inc. New York, 1991.

CR Rao, H. Toutenburg, Shalabh, and C. Heumann, Linear Models and
Generalizations - Least Squares and Alternatives, 3rd ed., Springer,

John Hendrickx, the author of -perturb- (and -coldiag2-) also wrote the
'perturb' package in R. The documentation contains a reference to a
paper, but the link is broken. (The Wissman et al. report refers to the paper differently and also gives a broken link.)

Hendrickx, John, Ben Pelzer. (2004). Collinearity involving ordered and
unordered categorical variables. Paper presented at the RC33 conference
in Amsterdam, August 17-20 2004. 


On Apr 26, 2013, at 9:33 AM, Maarten Buis wrote:

On Fri, Apr 26, 2013 at 2:58 PM, Mitchell F. Berman wrote:
> I see that for a single categorical variable
> broken into dummy variables, collinearity between the dummy variables would
> be zero.

That is incorrect, the correlation between these indicator variables
tend to be negative and can easily be non-trivial.

> People mention standard VIF (which I know how to do), but the more thorough
> answers imply this is not correct.

Multicolinearity is all about correlation, so I see no problem with
using VIF. The VIF is based on correlation. and though you want to be
careful using correlation between binariy variables (or categorical
variable split up into different binary variables) when doing
substantive research, it is perfectly ok to use that to diagnose
multicolineartiy because that linear association is the real problem
when it comes to multicolinearity.

> I was trying to get a
> sense of what the experts on the Stata List server use.

I tend to do nothing about it. Multicolinearity is not a problem, it
just an accurate representation that you have less information in your
data than you would have liked. That may be unfortunate, but it
certainly is not a problem that needs to be addressed. There are
always exceptions, but in those cases looking at patterns of linear
association between the explanatory variables is all that is needed,
so VIF would be perfectly fine.

-- Maarten

Maarten L. Buis
Reichpietschufer 50
10785 Berlin
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index