Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: collinear categorical variable identification


From   Buzz Burhans <wsb2@cornell.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: collinear categorical variable identification
Date   Thu, 12 Jun 2003 14:06:42 -0400

Dear Stata listers, especially epidemiologists,

I have a question related identification and removal of collinear categorical variables My question(s) are about use of coldiag or other methods to identify collinear dichotomous variables for logistic regression.

I have replaced the nominal and ordinal independent variables with dichotomous indicator variables. The dataset contains a fairly large number of factors which are collinear independent variables, and I am uncertain of the best way to identify and eliminate collinearity in the case of categorical variables.

I have used coldiag, with a cut off of 30, accompanied by theoretical plausibiity, to identify candidates for removal from independent variables due to collinearity. However, I find that there is still some instability in the model, indicated by large CIs for the odds ratios. I then looked at the correlation matrix for the regressors, and using a combination of identification by a lower singular value (10), and by what is suggested by the independent variable correlations, I identify candidates for further removal. In making the identification I consider the correlation of two independent variables (if > 0.35, strong consideration for removal) and the contribution to variance decomposition (> 0.5 sugests removal), and the strength of the correlation to the dependant variable ( the stronger of two variables suggests it should be retained when there are competing candidates), and theoretical plausibility.

My understanding is that using the correlation matrix when the regressor matrix includes dichotomous variables is not appropriate. However, the models are stable and sensible, and improved (more stable) from my earlier runs when I used simply the coldiag and a higher condition number as a cutoff. when I go back and tabulate the competing variables the in 2 way tables my decisions seem to be reasonable. When they are two dichotomous variables the odds ratios seem to support the decisions, and visual inspection of the tables for categorical variables with several categories seem consistent with the decisions70740buz
for retention or exclusion I made. ( the dataset is relatively small, and there are not uncommonly empty cells in the twoway tabulations.


Can you comment on my strategy, in particular on the appropriateness of coldiag approach in this case, and on the appropriateness of using a correlation assessment for categorical variables? Can you suggest a better strategy?

Thanks very much for any help you can offer.



Buzz Burhans
wsb2@cornell.edu



*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index