.- help for ^colldiag^ [sg32: STB-24] .- Collinearity Diagnostics ------------------------ ^colldiag^ [^nocons^] Description ----------- colldiag calculates and displays the matrix of variance decomposition proportions for the independent variables in a linear regression model. This command must follow a call to @fit@. Remarks ------- In the case of orthogonal predictors, the variance decomposition proportion matrix would be an identity matrix. One should examine the dependencies of the variances on the principal components, by focusing on the decomposition of the variables associated with high condition numbers. The condition number is a measure of the dependence of the independent variables. Typical values used are (n* = ) 10, 15, or even 30. As you look at the row associated with the high condition numbers, you should note the variance decomposition propor- tions that are higher than some threshold value (like p*=.50). You should note the following: 1) The independent variable will have a degraded coefficient because of a near dependency if it is one of two or more variates with variance-decomposition proportions in excess of some threshold value p*, such as .50. The number of near dependencies is the number of condition numbers greater than the threshold value n*. 2) Those variates whose aggregate variance-decomposition proportions exceed the threshold value p* are involved in at least one of the dependencies. The aggregate is formed over the competing condition numbers (condition numbers of the same order of magnitude that exceed the threshold value n*). 3) A dominating dependency occurs when the condition number is an order of magnitude larger than the other condition numbers. This can obscure information about the variate's simultaneous involvement in a weaker dependency. In this case, additional analysis is warranted to investigate the relationships of all potentially involved variates. Example ------- We have data on men involved in a physical fitness course. The purpose of the study is to model the oxygen uptake rate by the age, weight, time to run one-and-a-half miles, the heart rate while resting, heart rate while running, and the maximum heart rate while running. . ^describe^ Contains data from fitness.dta Obs: 31 (max= 50172) Fitness data Vars: 7 (max= 99) 16 Nov 1994 15:47 Width: 28 (max= 200) 1. age float %9.0g 2. weight float %9.0g 3. oxy float %9.0g 4. runtime float %9.0g 5. rstpulse float %9.0g 6. runpulse float %9.0g 7. maxpulse float %9.0g Sorted by: . ^fit oxy age weight runtime rstpulse runpulse maxpulse^ Source | SS df MS Number of obs = 31 ---------+------------------------------ F( 6, 24) = 22.43 Model | 722.543528 6 120.423921 Prob > F = 0.0000 Residual | 128.837947 24 5.3682478 R-squared = 0.8487 ---------+------------------------------ Adj R-squared = 0.8108 Total | 851.381475 30 28.3793825 Root MSE = 2.3169 ------------------------------------------------------------------------------ oxy | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | -.2269738 .0998375 -2.273 0.032 -.4330282 -.0209194 weight | -.0741774 .0545932 -1.359 0.187 -.1868521 .0384974 runtime | -2.628653 .3845622 -6.835 0.000 -3.42235 -1.834955 rstpulse | -.0215336 .0660543 -0.326 0.747 -.1578629 .1147957 runpulse | -.3696278 .1198529 -3.084 0.005 -.6169921 -.1222634 maxpulse | .3032171 .1364952 2.221 0.036 .0215049 .5849294 _cons | 102.9345 12.40326 8.299 0.000 77.33541 128.5335 ------------------------------------------------------------------------------ We are somewhat concerned that there may be a dependency among the pulse variables and investigate this with the new diagnostic tool. . ^colldiag^ Proportion of variance associated with the decomposition Cond | Number | age weight runtime rstpulse runpulse maxpulse _cons ---------+-------------------------------------------------------------------- 1 | 0.0002 0.0002 0.0002 0.0003 0.0000 0.0000 0.0000 19.2909 | 0.1463 0.0104 0.0252 0.3906 0.0000 0.0000 0.0022 21.5007 | 0.1501 0.2357 0.1286 0.0281 0.0012 0.0012 0.0006 27.6212 | 0.0319 0.1831 0.6090 0.1903 0.0015 0.0012 0.0064 33.8292 | 0.1128 0.4444 0.1250 0.3648 0.0151 0.0083 0.0013 82.6376 | 0.4966 0.1033 0.0975 0.0203 0.0695 0.0056 0.7997 196.786 | 0.0621 0.0228 0.0146 0.0057 0.9128 0.9836 0.1898 If we use 30 as our value for n* and .50 as our threshold for p*, then we see that points 2 and 3 from above are exhibited in our output. The competing dependency is for the condition numbers 33.8292 and 82.6376 which are of the same order of magnitude and both exceed our threshold value of 30. Aggregating the variance-decomposition proportions, we note that age (.1128+.4966=.6014), weight (.4444+.1033=.5477), and the constant (.0013+.7997=.8010) are involved in a competing dependency. We also note that we have a dominating dependency with a condition number greater than 196 and involving the runpulse and maxpulse variables. Since we have 3 near dependencies (3 condition numbers greater than n*=30), we should be able to express 3 of our independent variables in terms of the remaining 4. How do we choose the variates for which to solve? Beginning with the largest condition number, we see that we should choose either runpulse or maxpulse. Since maxpulse has the remainder of its variance determined in a more removed dependency, we can choose it as our first dependent variable in the auxiliary regression. Now, since we are not as interested in the constant term, we may choose the weight and age as our remaining pivots. Author ------ James W. Hardin, Stata Corporation EMAIL: tech-support@@stata.com See Also -------- Manual: [5s] fit On-line: @fit@, @vif@ if installed STB: STB-24: sg32