Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Interesting numerical accuracy/collinearity issue


From   jpitblado@stata.com (Jeff Pitblado, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Interesting numerical accuracy/collinearity issue
Date   Wed, 12 Apr 2006 13:37:55 -0500

Mark Schaffer <M.E.Schaffer@hw.ac.uk> is concerned about collinearity and the
-ovtest-:

> -ovtest- implements a version of the Ramsey RESET (sometimes called an
> "omitted variables test").  The textbook description of this particular
> version of the test is as follows:
> 
> 1.   Estimate the equation using -regress-.
> 2.   Calculate the predicted values of the dependent variable, yhat.
> 3.   Create new variables which are yhat^2, yhat^3 and yhat^4.
> 4.   Re-estimate the original equation, including yhat2, yhat3 and yhat4
> as regressors.
> 5.   Test yhat2, yhat3 and yhat4 for joint significance using an F test.
> 
> A large test statistic in (5) is evidence that the original equation is
> misspecified.
> 
> In fact, implementing the test exactly as above does not always generate
> output that matches that of -ovtest-.  What sometimes happens is that
> yhat2, yhat3 and yhat4 are nearly collinear with the other regressors in
> step (5), and a variable gets dropped.
> 
> What Stata's -ovtest- does to avoid this is to rescale yhat so that it
> lies in the unit interval.  Call this step 2a:
> 
> 2a.  sum yhat, meanonly;  replace yhat = (yhat-r(min))/(r(max)-r(min))
> 
> and in practice, this seems to eliminate collinearities.
> 
> What is curious is that the following alternative rescaling usually does
> *not* eliminate the collinearites, namely first calculate yhat^2, yhat^3
> and yhat^4, and *then* rescale these so that they lie in the unit
> interval.  Call this step 3a:
> 
> 3a.  sum yhat2, meanonly;  replace yhat2 =
> (yhat2-r(min))/(r(max)-r(min))
>      sum yhat3, meanonly;  replace yhat3 =
> (yhat3-r(min))/(r(max)-r(min))
>      sum yhat4, meanonly;  replace yhat4 =
> (yhat4-r(min))/(r(max)-r(min))
> 
> Below is an example.
> 
> Using steps 1-5 with no rescaling generates a collinearity and -regress-
> drops a variable in step 5.  -coldiag2- shows the condition number for
> the regression in step 5 is huge: 7,454,604
> 
> Using steps 1-5 plus 3a also generates a collinearity, and -regress-
> drops a variable in step 5.  -coldiag2- again shows the condition number
> for the regression in step 5 is huge, though a bit smaller: 1,658,268
> 
> Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not
> generate a collinearity, and in step 5 -regress- drops nothing.
> -coldiag2- shows the condition number for the regression in step 5 is
> much smaller, but still way above the rule of thumb that ">30 means
> collinearity problems": 538
> 
> My first question - why does the Stata method "work"?
> 
> My second question - *does* the Stata method work?  Or does rescaling
> followed by raising to the 2nd, 3rd and 4th power introduce numerical
> inaccuracies that cause what is a "genuine" near-collinearity to
> decrease so much that Stata's -regress- doesn't detect it?
> 
> Any ideas?  It's not because I'm using floats.  Doubles everywhere.
>
> (example omitted)

To sum up, Marks asks 

	1.  Why does generating the powered terms from 'yhatr' ( the centered
	    and rescaled 'yhat') circumvent the problem of collinearity, but
	    generating the powers before centering and rescaling doesn't.

	2.  Is Stata's -ovtest- producing valid results when the direct
	    application of the method would otherwise result in a collinearity
	    issue?

I'll answer the second question first.

Yes, Stata's -ovtest- is producing valid results when you would otherwise have
a problem with collinearity.  As pointed out in the following discussion, a
little algebra will show that the F test from -ovtest- is statistically
equivalent to that of the direct approach.  Collinearity is a numerical
nusiance here; since x^2, x^3, and x^4 are not mathematically collinear.

Now for question 1:

In the direct approach we have the following regression equation:

(1)	y = b0 + b1 x + b2 yhat^2 + b3 yhat^3 + b4 yhat^4

but -ovtest- fits the following regression

(2)	y = c0 + c1 x + c2 yhatr2 + c3 yhatr3 + c4 yhatr4

with

	yhatr2 = yhatr^2 = { (yhat - m)/r }^2

and similarly for yhatr3 and yhatr4; where m is the mean of yhat and r is its
range.  After a little algebra, we see that these two regressions are merely
reparameterizations of each other since 'm' and 'r' are fixed.

Now suppose that we are having collinearity problems with (1), but applied
Mark's 3a approach.  This would result in the following regression:

(3)	y = d0 + d1 x + d2 yhat2r + d3 yhat3r + d4 yhat4r

with

	yhat2r = (yhat^2 - m2)/r2

and similarly for yhat3r and yhat4r; where m2 is the mean of yhat^2, and r2 is
its range.  Notice that if there is a collinearity between x, yhat^2, yhat^3,
and yhat^4; you will necessarily have one between x, yhat2r, yhat3r, and
yhat4r.

	yhat2r, yhat3r, and yhat4r and simple linear transformations of
	yhat^2, yhat^3, and yhat^4; linear transformations preserve
	collinearity

Thus Mark's 3a does nothing to affect the collinear relationship among the
powers of yhat.  Stata's -ovtest- avoids the collinearity problem by shifting
and scaling yhat into a region where its second, third, and fourth powers are
no longer numerically collinear (given at least 4 observations).

--Jeff
jpitblado@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index