Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Steve Samuels <sjsamuels@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: collinearity in categorical variables |
Date | Fri, 26 Apr 2013 18:56:46 -0400 |
Mitchell: Solutions involving polychoric correlations don't appeal at all. They are dependent on the assumed multivariate Gaussian model for the unobserved latent variables. That's a strong assumption, and you must, in addition, estimate the correlation of the latent variables with the observed continuous predictors. Even then, there is no guarantee that the mashed-together correlation matrix will be positive definite. So I suggest that you just go ahead and use -perturb-. Shrinkage techniques are likely to both diagnose and ameliorate problems of highly correlated variables. See Le Cessie and Van Houwelingen (1992) and Harrell (2001). For penalized logistic or lasso (Tibshirani, 2011), try -plogit- by Gareth Ambler. See: http://www.homepages.ucl.ac.uk/~ucakgam/stata.html. You can install it from within Stata by typing: . net from http://www.homepages.ucl.ac.uk/~ucakgam/stata . net install plogit Steve References: S Le Cessie and JC Van Houwelingen (1992) Ridge estimators in logistic regression. Applied statistics 191-201. Harrell, Frank E. 2001. Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis. New York: Springer. Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267-288. Tibshirani, Robert. 2011. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, no. 3: 273-282. On Apr 26, 2013, at 11:24 AM, David Hoaglin wrote: Mitchell, To get information on "correlation" between two categorical variables, a crosstab would be a good start. The idea is to look at the data in detail before (or instead of) reducing the relation of the two variables to a single number. The "variance inflation factor" (VIF) is defined for an individual predictor variable. Conceptually, one uses that predictor as the dependent variable in a regression on all the other predictors, and interprets 1 - R^2 from the regression as the "usable fraction" of that predictor in the full regression model. The VIF for that predictor is the reciprocal of that 1 - R^2. A VIF gives information how badly the standard error for the estimate of that regression coefficient is degraded, but it does not show which predictors are involved in the collinearity (if a troublesome collinearity is present). That's where -coldiag2- and related methods come in. I don't know whether someone has generalized VIF to categorical variables, but you would not need that if you applied -coldiag2- to the full set of predictors. That diagnosis is designed for OLS regression, but it is often useful for logistic regression. I'll follow the links when I get a chance. David Hoaglin On Fri, Apr 26, 2013 at 8:58 AM, Mitchell F. Berman <mfb1@columbia.edu> wrote: > Thank you for the reply. Yes, I see that for a single categorical variable > broken into dummy variables, collinearity between the dummy variables would > be zero. > But my question concerns correlation between related, similar, categorical > variables. > > If I have multiple similar categorical variables, for example: homebound, > uses a walker, home-health aide, lives in nursing home, these categorical > variables will move together though the data--- won't be identical for all > patients, but correlated. > > People mention standard VIF (which I know how to do), but the more thorough > answers imply this is not correct. > > This links suggests perturb (a module available for Stata, R, and SPSS) or > polychoric correlation > http://stats.stackexchange.com/questions/35233/how-to-test-for-and-remedy-multicollinearity-in-optimal-scaling-ordinal-regressi > > This link from talkstats suggests that polychoric correlations (available in > R) are preferable, because correlations calculated using pearson product > moment are invalid for categorical data. > http://www.talkstats.com/showthread.php/22996-Collinearity-Among-Categorical-Variables-in-Regression > > someone else suggested spearman correlation coefficient > http://www.statisticsforums.com/showthread.php?t=802 > > factor analysis > http://www.talkstats.com/showthread.php/13264-Collinearity-in-Logistic-Regression > > This is beyond my level of theoretical understanding. I was trying to get a > sense of what the experts on the Stata List server use. > > Thank you for any additional input. > > Mitchell * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/