# Re: st: Interesting numerical accuracy/collinearity issue

 From jpitblado@stata.com (Jeff Pitblado, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: Interesting numerical accuracy/collinearity issue Date Wed, 12 Apr 2006 13:37:55 -0500

```Mark Schaffer <M.E.Schaffer@hw.ac.uk> is concerned about collinearity and the
-ovtest-:

> -ovtest- implements a version of the Ramsey RESET (sometimes called an
> "omitted variables test").  The textbook description of this particular
> version of the test is as follows:
>
> 1.   Estimate the equation using -regress-.
> 2.   Calculate the predicted values of the dependent variable, yhat.
> 3.   Create new variables which are yhat^2, yhat^3 and yhat^4.
> 4.   Re-estimate the original equation, including yhat2, yhat3 and yhat4
> as regressors.
> 5.   Test yhat2, yhat3 and yhat4 for joint significance using an F test.
>
> A large test statistic in (5) is evidence that the original equation is
> misspecified.
>
> In fact, implementing the test exactly as above does not always generate
> output that matches that of -ovtest-.  What sometimes happens is that
> yhat2, yhat3 and yhat4 are nearly collinear with the other regressors in
> step (5), and a variable gets dropped.
>
> What Stata's -ovtest- does to avoid this is to rescale yhat so that it
> lies in the unit interval.  Call this step 2a:
>
> 2a.  sum yhat, meanonly;  replace yhat = (yhat-r(min))/(r(max)-r(min))
>
> and in practice, this seems to eliminate collinearities.
>
> What is curious is that the following alternative rescaling usually does
> *not* eliminate the collinearites, namely first calculate yhat^2, yhat^3
> and yhat^4, and *then* rescale these so that they lie in the unit
> interval.  Call this step 3a:
>
> 3a.  sum yhat2, meanonly;  replace yhat2 =
> (yhat2-r(min))/(r(max)-r(min))
>      sum yhat3, meanonly;  replace yhat3 =
> (yhat3-r(min))/(r(max)-r(min))
>      sum yhat4, meanonly;  replace yhat4 =
> (yhat4-r(min))/(r(max)-r(min))
>
> Below is an example.
>
> Using steps 1-5 with no rescaling generates a collinearity and -regress-
> drops a variable in step 5.  -coldiag2- shows the condition number for
> the regression in step 5 is huge: 7,454,604
>
> Using steps 1-5 plus 3a also generates a collinearity, and -regress-
> drops a variable in step 5.  -coldiag2- again shows the condition number
> for the regression in step 5 is huge, though a bit smaller: 1,658,268
>
> Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not
> generate a collinearity, and in step 5 -regress- drops nothing.
> -coldiag2- shows the condition number for the regression in step 5 is
> much smaller, but still way above the rule of thumb that ">30 means
> collinearity problems": 538
>
> My first question - why does the Stata method "work"?
>
> My second question - *does* the Stata method work?  Or does rescaling
> followed by raising to the 2nd, 3rd and 4th power introduce numerical
> inaccuracies that cause what is a "genuine" near-collinearity to
> decrease so much that Stata's -regress- doesn't detect it?
>
> Any ideas?  It's not because I'm using floats.  Doubles everywhere.
>
> (example omitted)

1.  Why does generating the powered terms from 'yhatr' ( the centered
and rescaled 'yhat') circumvent the problem of collinearity, but
generating the powers before centering and rescaling doesn't.

2.  Is Stata's -ovtest- producing valid results when the direct
application of the method would otherwise result in a collinearity
issue?

I'll answer the second question first.

Yes, Stata's -ovtest- is producing valid results when you would otherwise have
a problem with collinearity.  As pointed out in the following discussion, a
little algebra will show that the F test from -ovtest- is statistically
equivalent to that of the direct approach.  Collinearity is a numerical
nusiance here; since x^2, x^3, and x^4 are not mathematically collinear.

Now for question 1:

In the direct approach we have the following regression equation:

(1)	y = b0 + b1 x + b2 yhat^2 + b3 yhat^3 + b4 yhat^4

but -ovtest- fits the following regression

(2)	y = c0 + c1 x + c2 yhatr2 + c3 yhatr3 + c4 yhatr4

with

yhatr2 = yhatr^2 = { (yhat - m)/r }^2

and similarly for yhatr3 and yhatr4; where m is the mean of yhat and r is its
range.  After a little algebra, we see that these two regressions are merely
reparameterizations of each other since 'm' and 'r' are fixed.

Now suppose that we are having collinearity problems with (1), but applied
Mark's 3a approach.  This would result in the following regression:

(3)	y = d0 + d1 x + d2 yhat2r + d3 yhat3r + d4 yhat4r

with

yhat2r = (yhat^2 - m2)/r2

and similarly for yhat3r and yhat4r; where m2 is the mean of yhat^2, and r2 is
its range.  Notice that if there is a collinearity between x, yhat^2, yhat^3,
and yhat^4; you will necessarily have one between x, yhat2r, yhat3r, and
yhat4r.

yhat2r, yhat3r, and yhat4r and simple linear transformations of
yhat^2, yhat^3, and yhat^4; linear transformations preserve
collinearity

Thus Mark's 3a does nothing to affect the collinear relationship among the
powers of yhat.  Stata's -ovtest- avoids the collinearity problem by shifting
and scaling yhat into a region where its second, third, and fourth powers are
no longer numerically collinear (given at least 4 observations).

--Jeff