# st: Interesting numerical accuracy/collinearity issue

 From "Schaffer, Mark E" To Subject st: Interesting numerical accuracy/collinearity issue Date Tue, 11 Apr 2006 21:32:22 +0100

```<reposting - first time didn't get through>

Dear Statalisters:

I've run into a numerical accuracy/collinearity issue that I think might
be of interest.  It relates specifically to a built-in Stata command,
-ovtest-, but I think it raises general issues.

-ovtest- implements a version of the Ramsey RESET (sometimes called an
"omitted variables test").  The textbook description of this particular
version of the test is as follows:

1.   Estimate the equation using -regress-.
2.   Calculate the predicted values of the dependent variable, yhat.
3.   Create new variables which are yhat^2, yhat^3 and yhat^4.
4.   Re-estimate the original equation, including yhat2, yhat3 and yhat4
as regressors.
5.   Test yhat2, yhat3 and yhat4 for joint significance using an F test.

A large test statistic in (5) is evidence that the original equation is
misspecified.

In fact, implementing the test exactly as above does not always generate
output that matches that of -ovtest-.  What sometimes happens is that
yhat2, yhat3 and yhat4 are nearly collinear with the other regressors in
step (5), and a variable gets dropped.

What Stata's -ovtest- does to avoid this is to rescale yhat so that it
lies in the unit interval.  Call this step 2a:

2a.  sum yhat, meanonly;  replace yhat = (yhat-r(min))/(r(max)-r(min))

and in practice, this seems to eliminate collinearities.

What is curious is that the following alternative rescaling usually does
*not* eliminate the collinearites, namely first calculate yhat^2, yhat^3
and yhat^4, and *then* rescale these so that they lie in the unit
interval.  Call this step 3a:

3a.  sum yhat2, meanonly;  replace yhat2 =
(yhat2-r(min))/(r(max)-r(min))
sum yhat3, meanonly;  replace yhat3 =
(yhat3-r(min))/(r(max)-r(min))
sum yhat4, meanonly;  replace yhat4 =
(yhat4-r(min))/(r(max)-r(min))

Below is an example.

Using steps 1-5 with no rescaling generates a collinearity and -regress-
drops a variable in step 5.  -coldiag2- shows the condition number for
the regression in step 5 is huge: 7,454,604

Using steps 1-5 plus 3a also generates a collinearity, and -regress-
drops a variable in step 5.  -coldiag2- again shows the condition number
for the regression in step 5 is huge, though a bit smaller: 1,658,268

Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not
generate a collinearity, and in step 5 -regress- drops nothing.
-coldiag2- shows the condition number for the regression in step 5 is
much smaller, but still way above the rule of thumb that ">30 means
collinearity problems": 538

My first question - why does the Stata method "work"?

My second question - *does* the Stata method work?  Or does rescaling
followed by raising to the 2nd, 3rd and 4th power introduce numerical
inaccuracies that cause what is a "genuine" near-collinearity to
decrease so much that Stata's -regress- doesn't detect it?

Any ideas?  It's not because I'm using floats.  Doubles everywhere.

--Mark

***************** Example output **************** . version 8.2

. version
version 8.2

. which ovtest
*! version 2.3.6  05sep2001

. which coldiag2
*! version 2.0, 01Dec2004, John_Hendrickx@yahoo.com

.
. use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear
(Wages of Very Young Men, Zvi Griliches, J.Pol.Ec. 1976)

.
. * Generate yhats
. qui regress lw s

. qui predict double yhat

. * yhatr=rescaled yhat
. sum yhat, meanonly

. qui gen double yhatr = (yhat-r(min))/(r(max)-r(min))

. qui gen double yhat2=yhat^2

. qui gen double yhat3=yhat^3

. qui gen double yhat4=yhat^4

. * yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 . qui gen
double yhatr2=yhatr^2

. qui gen double yhatr3=yhatr^3

. qui gen double yhatr4=yhatr^4

. * yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r .
sum yhat2, meanonly

. qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))

. sum yhat3, meanonly

. qui gen double yhat3r = (yhat3-r(min))/(r(max)-r(min))

. sum yhat4, meanonly

. qui gen double yhat4r = (yhat4-r(min))/(r(max)-r(min))

. * Summarize variables
. sum lw s yhat*

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
lw |       758    5.686739    .4289494      4.605      7.051
s |       758    13.40501    2.231828          9         18
yhat |       758    5.686739    .2156493   5.261107   6.130727
yhatr |       758    .4894459    .2479809          0          1
yhat2 |       758    32.38544    2.473995   27.67924   37.58581
-------------+--------------------------------------------------------
yhat3 |       758    184.7003     21.3139   145.6234   230.4284
yhat4 |       758    1054.929    163.4247   766.1405   1412.693
yhatr2 |       758    .3009707    .2769761          0          1
yhatr3 |       758    .2142687    .2681644          0          1
yhatr4 |       758    .1671979    .2530434          0          1
-------------+--------------------------------------------------------
yhat2r |       758    .4750582    .2497327          0          1
yhat3r |       758    .4607848    .2513286          0          1
yhat4r |       758    .4466593     .252763          0          1

.
. * Quadratic form of RESET
. * 1.  Unrescaled RESET
. * Collinearity appears
. qui regress lw s  yhat2 yhat3 yhat4

. testparm yhat2 yhat3 yhat4

( 1)  yhat2 = 0
( 2)  yhat3 = 0
( 3)  yhat4 = 0
Constraint 1 dropped

F(  2,   754) =    0.87
Prob > F =    0.4191

. * 2.  yhat that is first ^2, ^3, ^4, then rescaled . * Collinearity
appears . qui regress lw s  yhat2r yhat3r yhat4r

. testparm yhat2r yhat3r yhat4r

( 1)  yhat2r = 0
( 2)  yhat3r = 0
( 3)  yhat4r = 0
Constraint 1 dropped

F(  2,   754) =    0.87
Prob > F =    0.4191

. * 3.  yhat that is first rescaled, then ^2, ^3, ^4 . * No collinearity
. qui regress lw s  yhatr2 yhatr3 yhatr4

. testparm yhatr2 yhatr3 yhatr4

( 1)  yhatr2 = 0
( 2)  yhatr3 = 0
( 3)  yhatr4 = 0

F(  3,   753) =    0.59
Prob > F =    0.6216

. * 4.  Stata's built-in ovtest
. *     Matches first-rescaled-then-powered, i.e., (3)
. qui regress lw s

. ovtest

Ramsey RESET test using powers of the fitted values of lw
Ho:  model has no omitted variables
F(3, 753) =      0.59
Prob > F =      0.6216

.
. * Collinearities
. _rmcoll s  yhat2 yhat3 yhat4
note: yhat2 dropped due to collinearity

. _rmcoll s  yhat2r yhat3r yhat4r
note: yhat2r dropped due to collinearity

. _rmcoll s  yhatr2 yhatr3 yhatr4

.
. * -coldiag2-
. coldiag2 s  yhat2 yhat3 yhat4

Condition number using scaled variables =   7454604.11

Condition Indexes and Variance-Decomposition Proportions

condition
index _cons     s yhat2 yhat3 yhat4

1        1.00  0.00  0.00  0.00  0.00  0.00
2       16.73  0.00  0.00  0.00  0.00  0.00
3      359.37  0.00  0.00  0.00  0.00  0.00
4    33354.15  0.00  0.00  0.00  0.00  0.00
5  7454604.11  1.00  1.00  1.00  1.00  1.00

. coldiag2 s  yhat2r yhat3r yhat4r

Condition number using scaled variables =   1658268.08

Condition Indexes and Variance-Decomposition Proportions

condition
index  _cons      s yhat2r yhat3r yhat4r

1        1.00   0.00   0.00   0.00   0.00   0.00
2        4.69   0.00   0.00   0.00   0.00   0.00
3      168.37   0.00   0.00   0.00   0.00   0.00
4    12145.95   0.00   0.00   0.00   0.00   0.00
5  1658268.08   1.00   1.00   1.00   1.00   1.00

. coldiag2 s  yhatr2 yhatr3 yhatr4

Condition number using scaled variables =       538.15

Condition Indexes and Variance-Decomposition Proportions

condition
index  _cons      s yhatr2 yhatr3 yhatr4

>
1    1.00   0.00   0.00   0.00   0.00   0.00
2    2.38   0.00   0.00   0.00   0.00   0.00
3   14.57   0.00   0.00   0.00   0.00   0.00
4   96.54   0.08   0.06   0.00   0.02   0.05
5  538.15   0.92   0.94   1.00   0.98   0.94

*********** do file to generate output **************

version 8.2
version
which ovtest
which coldiag2

use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear

* Generate yhats
qui regress lw s
qui predict double yhat
* yhatr=rescaled yhat
sum yhat, meanonly
qui gen double yhatr = (yhat-r(min))/(r(max)-r(min)) qui gen double
yhat2=yhat^2 qui gen double yhat3=yhat^3 qui gen double yhat4=yhat^4
* yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 qui gen
double yhatr2=yhatr^2 qui gen double yhatr3=yhatr^3 qui gen double
yhatr4=yhatr^4
* yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r sum
yhat2, meanonly qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))
sum yhat3, meanonly qui gen double yhat3r =
(yhat3-r(min))/(r(max)-r(min)) sum yhat4, meanonly qui gen double yhat4r
= (yhat4-r(min))/(r(max)-r(min))
* Summarize variables
sum lw s yhat*

* 1.  Unrescaled RESET
* Collinearity appears
qui regress lw s  yhat2 yhat3 yhat4
testparm yhat2 yhat3 yhat4
* 2.  yhat that is first ^2, ^3, ^4, then rescaled
* Collinearity appears
qui regress lw s  yhat2r yhat3r yhat4r
testparm yhat2r yhat3r yhat4r
* 3.  yhat that is first rescaled, then ^2, ^3, ^4
* No collinearity
qui regress lw s  yhatr2 yhatr3 yhatr4
testparm yhatr2 yhatr3 yhatr4
* 4.  Stata's built-in ovtest
*     Matches first-rescaled-then-powered, i.e., (3)
qui regress lw s
ovtest

* Collinearities
_rmcoll s  yhat2 yhat3 yhat4
_rmcoll s  yhat2r yhat3r yhat4r
_rmcoll s  yhatr2 yhatr3 yhatr4

* -coldiag2-
coldiag2 s  yhat2 yhat3 yhat4
coldiag2 s  yhat2r yhat3r yhat4r
coldiag2 s  yhatr2 yhatr3 yhatr4

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```