Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Interesting numerical accuracy/collinearity issue


From   "Schaffer, Mark E" <M.E.Schaffer@hw.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Interesting numerical accuracy/collinearity issue
Date   Wed, 12 Apr 2006 20:46:54 +0100

Thanks, Jeff, that's very helpful, but can I ask a follow-up?  (See below.)

> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu 
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of 
> Jeff Pitblado, StataCorp LP
> Sent: 12 April 2006 19:38
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Interesting numerical accuracy/collinearity issue
> 
> Mark Schaffer <M.E.Schaffer@hw.ac.uk> is concerned about 
> collinearity and the
> -ovtest-:
> 
> > -ovtest- implements a version of the Ramsey RESET (sometimes called an 
> > "omitted variables test").  The textbook description of this 
> > particular version of the test is as follows:
> > 
> > 1.   Estimate the equation using -regress-.
> > 2.   Calculate the predicted values of the dependent variable, yhat.
> > 3.   Create new variables which are yhat^2, yhat^3 and yhat^4.
> > 4.   Re-estimate the original equation, including yhat2, yhat3 and yhat4
> > as regressors.
> > 5.   Test yhat2, yhat3 and yhat4 for joint significance using an F test.
> > 
> > A large test statistic in (5) is evidence that the original equation 
> > is misspecified.
> > 
> > In fact, implementing the test exactly as above does not always 
> > generate output that matches that of -ovtest-.  What sometimes happens 
> > is that yhat2, yhat3 and yhat4 are nearly collinear with the other 
> > regressors in step (5), and a variable gets dropped.
> > 
> > What Stata's -ovtest- does to avoid this is to rescale yhat so that it 
> > lies in the unit interval.  Call this step 2a:
> > 
> > 2a.  sum yhat, meanonly;  replace yhat =  (yhat-r(min))/(r(max)-r(min))
> > 
> > and in practice, this seems to eliminate collinearities.
> > 
> > What is curious is that the following alternative rescaling usually does
> > *not* eliminate the collinearites, namely first calculate yhat^2, 
> > yhat^3 and yhat^4, and *then* rescale these so that they lie in the 
> > unit interval.  Call this step 3a:
> > 
> > 3a.  sum yhat2, meanonly;  replace yhat2 =
> > (yhat2-r(min))/(r(max)-r(min))
> >      sum yhat3, meanonly;  replace yhat3 =
> > (yhat3-r(min))/(r(max)-r(min))
> >      sum yhat4, meanonly;  replace yhat4 =
> > (yhat4-r(min))/(r(max)-r(min))
> > 
> > Below is an example.
> > 
> > Using steps 1-5 with no rescaling generates a collinearity and 
> > -regress- drops a variable in step 5.  -coldiag2- shows the condition 
> > number for the regression in step 5 is huge: 7,454,604
> > 
> > Using steps 1-5 plus 3a also generates a collinearity, and -regress- 
> > drops a variable in step 5.  -coldiag2- again shows the condition 
> > number for the regression in step 5 is huge, though a bit smaller: 
> > 1,658,268
> > 
> > Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not 
> > generate a collinearity, and in step 5 -regress- drops nothing.
> > -coldiag2- shows the condition number for the regression in step 5 is 
> > much smaller, but still way above the rule of thumb that ">30 means 
> > collinearity problems": 538
> > 
> > My first question - why does the Stata method "work"?
> > 
> > My second question - *does* the Stata method work?  Or does rescaling 
> > followed by raising to the 2nd, 3rd and 4th power introduce numerical 
> > inaccuracies that cause what is a "genuine" near-collinearity to 
> > decrease so much that Stata's -regress- doesn't detect it?
> > 
> > Any ideas?  It's not because I'm using floats.  Doubles everywhere.
> >
> > (example omitted)
> 
> To sum up, Marks asks 
> 
> 	1.  Why does generating the powered terms from 'yhatr' (the centered
> 	    and rescaled 'yhat') circumvent the problem of collinearity, but
> 	    generating the powers before centering and rescaling doesn't.
> 
> 	2.  Is Stata's -ovtest- producing valid results when the direct
> 	    application of the method would otherwise result in a collinearity
> 	    issue?
> 
> I'll answer the second question first.
> 
> Yes, Stata's -ovtest- is producing valid results when you 
> would otherwise have a problem with collinearity.  As pointed 
> out in the following discussion, a little algebra will show 
> that the F test from -ovtest- is statistically equivalent to 
> that of the direct approach.  Collinearity is a numerical 
> nusiance here; since x^2, x^3, and x^4 are not mathematically 
> collinear.
>
> Now for question 1:
> 
> In the direct approach we have the following regression equation:
> 
> (1)	y = b0 + b1 x + b2 yhat^2 + b3 yhat^3 + b4 yhat^4
> 
> but -ovtest- fits the following regression
> 
> (2)	y = c0 + c1 x + c2 yhatr2 + c3 yhatr3 + c4 yhatr4
> 
> with
> 
> 	yhatr2 = yhatr^2 = { (yhat - m)/r }^2
> 
> and similarly for yhatr3 and yhatr4; where m is the mean of 
> yhat and r is its range.  After a little algebra, we see that 
> these two regressions are merely reparameterizations of each 
> other since 'm' and 'r' are fixed.
> 
> Now suppose that we are having collinearity problems with 
> (1), but applied Mark's 3a approach.  This would result in 
> the following regression:
> 
> (3)	y = d0 + d1 x + d2 yhat2r + d3 yhat3r + d4 yhat4r
> 
> with
> 
> 	yhat2r = (yhat^2 - m2)/r2
> 
> and similarly for yhat3r and yhat4r; where m2 is the mean of 
> yhat^2, and r2 is its range.  Notice that if there is a 
> collinearity between x, yhat^2, yhat^3, and yhat^4; you will 
> necessarily have one between x, yhat2r, yhat3r, and yhat4r.
> 
> 	yhat2r, yhat3r, and yhat4r and simple linear transformations of
> 	yhat^2, yhat^3, and yhat^4; linear transformations preserve
> 	collinearity
> 
> Thus Mark's 3a does nothing to affect the collinear 
> relationship among the powers of yhat.  Stata's -ovtest- 
> avoids the collinearity problem by shifting and scaling yhat 
> into a region where its second, third, and fourth powers are 
> no longer numerically collinear (given at least 4 observations).

My follow-up question is simple: why does the shifting and scaling used by Stata's ‑ovtest‑ introduce greater accuracy rather than, say, greater rounding error?  (Either accuracy or error would remove the numerical collinearity.)  The algebra doesn't help me here, since all three methods are algebraically equivalent.  I'm guessing that there's probably a general principle about how best to maintain numerical precision, but I don't know what it might be.

--Mark


> --Jeff
> jpitblado@stata.com
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index