Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | David Hoaglin <dchoaglin@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: collinearity in categorical variables |
Date | Fri, 26 Apr 2013 11:24:38 -0400 |
Mitchell, To get information on "correlation" between two categorical variables, a crosstab would be a good start. The idea is to look at the data in detail before (or instead of) reducing the relation of the two variables to a single number. The "variance inflation factor" (VIF) is defined for an individual predictor variable. Conceptually, one uses that predictor as the dependent variable in a regression on all the other predictors, and interprets 1 - R^2 from the regression as the "usable fraction" of that predictor in the full regression model. The VIF for that predictor is the reciprocal of that 1 - R^2. A VIF gives information how badly the standard error for the estimate of that regression coefficient is degraded, but it does not show which predictors are involved in the collinearity (if a troublesome collinearity is present). That's where -coldiag2- and related methods come in. I don't know whether someone has generalized VIF to categorical variables, but you would not need that if you applied -coldiag2- to the full set of predictors. That diagnosis is designed for OLS regression, but it is often useful for logistic regression. I'll follow the links when I get a chance. David Hoaglin On Fri, Apr 26, 2013 at 8:58 AM, Mitchell F. Berman <mfb1@columbia.edu> wrote: > Thank you for the reply. Yes, I see that for a single categorical variable > broken into dummy variables, collinearity between the dummy variables would > be zero. > But my question concerns correlation between related, similar, categorical > variables. > > If I have multiple similar categorical variables, for example: homebound, > uses a walker, home-health aide, lives in nursing home, these categorical > variables will move together though the data--- won't be identical for all > patients, but correlated. > > People mention standard VIF (which I know how to do), but the more thorough > answers imply this is not correct. > > This links suggests perturb (a module available for Stata, R, and SPSS) or > polychoric correlation > http://stats.stackexchange.com/questions/35233/how-to-test-for-and-remedy-multicollinearity-in-optimal-scaling-ordinal-regressi > > This link from talkstats suggests that polychoric correlations (available in > R) are preferable, because correlations calculated using pearson product > moment are invalid for categorical data. > http://www.talkstats.com/showthread.php/22996-Collinearity-Among-Categorical-Variables-in-Regression > > someone else suggested spearman correlation coefficient > http://www.statisticsforums.com/showthread.php?t=802 > > factor analysis > http://www.talkstats.com/showthread.php/13264-Collinearity-in-Logistic-Regression > > This is beyond my level of theoretical understanding. I was trying to get a > sense of what the experts on the Stata List server use. > > Thank you for any additional input. > > Mitchell * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/