[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"David E Moore" <davem@hartman-group.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: RE: collinear categorical variable identification |

Date |
Thu, 12 Jun 2003 12:46:56 -0700 |

One specific comment and then some general comments. Note, these are just my opinions on matters over which reasonable researchers might differ, so they're offered in the spirit of opening rather than shutting down discussion. Having said that, let me note further that my comments are critical of the approach described by Buzz Burhans. Specifically, in response to the question of the appropriateness of using a correlation matrix with dummy variables, don't. Personally, I think the common practice of examining correlation matrices to solve collinearity problems, regardless of how variables are measured, is ill advised. If the collinearity were simple enough to be revealed by a correlation matrix, then it wouldn't take a correlation matrix to find it. More to the point, problems of collinearity are often so complex that a correlation matrix will obscure as much as it reveals. Now, more generally, the problem of collinearity is one of estimation, so it would be nice if a few tools and rules of thumb could get us around it. Unfortunately, this is not the case. Just as we shouldn't be tempted to use stepwise methods to formulate regression models, I don't think we should rely on automated processes for diagnosing and solving problems of collinearity. Buzz Burhans has indicated that "theoretical plausibility" is one of the criteria he used. Aside from the estimated coefficients and standard errors (or CIs), which alert us to the existence of the problem, I submit this is the only criterion that should be used. (I assume that whatever procedure is followed, when dummy variables are involved they are excluded in whole sets corresponding to the original variables and not discarded willy-nilly.) Problems with collinearity should be readily apparent from the behavior of the estimated standard errors and/or coefficients. Short of collecting more data, which is often the best solution, solving the problem is much more difficult than identifying it. I say it's difficult because I assume every variable included has a theoretical reason for being included and the researcher is rarely justified in discarding relevant variables. However, when collinearity is severe enough that we can't estimate a model, then we have to make some compromises. This is where I believe it is the researcher's responsibility to reconsider or rethink the theory that led to the model. Are the variables included truly distinct factors or is there redundancy when variables are combined? As an aside, which means I'm not necessarily talking about Buzz Burhans' situation, it's been my experience that far too many "problems" are blamed on collinearity. A parameter estimate with a large variance is not by itself a symptom of collinearity, for example. More often than not, it indicates an irrelevant variable has been included in the analysis -- a theoretical problem rather than a collinearity problem. In general, misspecification errors are far more common than collinearity problems and should be ruled out before suspecting collinearity. Dave Moore > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu > [mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Buzz Burhans > Sent: Thursday, June 12, 2003 11:07 AM > To: statalist@hsphsun2.harvard.edu > Subject: st: collinear categorical variable identification > > > Dear Stata listers, especially epidemiologists, > > I have a question related identification and removal of collinear > categorical variables My question(s) are about use of coldiag or other > methods to identify collinear dichotomous variables for logistic regression. > > I have replaced the nominal and ordinal independent variables with > dichotomous indicator variables. The dataset contains a fairly large > number of factors which are collinear independent variables, and I am > uncertain of the best way to identify and eliminate collinearity in the > case of categorical variables. > > I have used coldiag, with a cut off of 30, accompanied by theoretical > plausibiity, to identify candidates for removal from independent variables > due to collinearity. However, I find that there is still some instability > in the model, indicated by large CIs for the odds ratios. I then looked at > the correlation matrix for the regressors, and using a combination of > identification by a lower singular value (10), and by what is suggested by > the independent variable correlations, I identify candidates for further > removal. In making the identification I consider the correlation of two > independent variables (if > 0.35, strong consideration for removal) and the > contribution to variance decomposition (> 0.5 sugests removal), and the > strength of the correlation to the dependant variable ( the stronger of two > variables suggests it should be retained when there are competing > candidates), and theoretical plausibility. > > My understanding is that using the correlation matrix when the regressor > matrix includes dichotomous variables is not appropriate. However, the > models are stable and sensible, and improved (more stable) from my earlier > runs when I used simply the coldiag and a higher condition number as a > cutoff. when I go back and tabulate the competing variables the in 2 way > tables my decisions seem to be reasonable. When they are two dichotomous > variables the odds ratios seem to support the decisions, and visual > inspection of the tables for categorical variables with several categories > seem consistent with the decisions70740buz > for retention or exclusion I made. ( the dataset is relatively small, and > there are not uncommonly empty cells in the twoway tabulations. > > > Can you comment on my strategy, in particular on the appropriateness of > coldiag approach in this case, and on the appropriateness of using a > correlation assessment for categorical variables? Can you suggest a better > strategy? > > Thanks very much for any help you can offer. > > > > Buzz Burhans > wsb2@cornell.edu > > > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: Re: collinear categorical variable identification***From:*Buzz Burhans <wsb2@cornell.edu>

**References**:**st: collinear categorical variable identification***From:*Buzz Burhans <wsb2@cornell.edu>

- Prev by Date:
**st: RE: RE: RE: Graph matrix with non-linear regression lines** - Next by Date:
**st: RE: RE: RE: postfile and post using a large number of items** - Previous by thread:
**st: collinear categorical variable identification** - Next by thread:
**st: Re: collinear categorical variable identification** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |