[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Interesting numerical accuracy/collinearity issue

From	"Schaffer, Mark E" <[email protected]>
To	<[email protected]>
Subject	st: Interesting numerical accuracy/collinearity issue
Date	Tue, 11 Apr 2006 21:32:22 +0100
<reposting - first time didn't get through> 

Dear Statalisters:

I've run into a numerical accuracy/collinearity issue that I think might
be of interest.  It relates specifically to a built-in Stata command,
-ovtest-, but I think it raises general issues.

-ovtest- implements a version of the Ramsey RESET (sometimes called an
"omitted variables test").  The textbook description of this particular
version of the test is as follows:

1.   Estimate the equation using -regress-.
2.   Calculate the predicted values of the dependent variable, yhat.
3.   Create new variables which are yhat^2, yhat^3 and yhat^4.
4.   Re-estimate the original equation, including yhat2, yhat3 and yhat4
as regressors.
5.   Test yhat2, yhat3 and yhat4 for joint significance using an F test.

A large test statistic in (5) is evidence that the original equation is
misspecified.

In fact, implementing the test exactly as above does not always generate
output that matches that of -ovtest-.  What sometimes happens is that
yhat2, yhat3 and yhat4 are nearly collinear with the other regressors in
step (5), and a variable gets dropped.

What Stata's -ovtest- does to avoid this is to rescale yhat so that it
lies in the unit interval.  Call this step 2a:

2a.  sum yhat, meanonly;  replace yhat = (yhat-r(min))/(r(max)-r(min))

and in practice, this seems to eliminate collinearities.

What is curious is that the following alternative rescaling usually does
*not* eliminate the collinearites, namely first calculate yhat^2, yhat^3
and yhat^4, and *then* rescale these so that they lie in the unit
interval.  Call this step 3a:

3a.  sum yhat2, meanonly;  replace yhat2 =
(yhat2-r(min))/(r(max)-r(min))
     sum yhat3, meanonly;  replace yhat3 =
(yhat3-r(min))/(r(max)-r(min))
     sum yhat4, meanonly;  replace yhat4 =
(yhat4-r(min))/(r(max)-r(min))

Below is an example.

Using steps 1-5 with no rescaling generates a collinearity and -regress-
drops a variable in step 5.  -coldiag2- shows the condition number for
the regression in step 5 is huge: 7,454,604

Using steps 1-5 plus 3a also generates a collinearity, and -regress-
drops a variable in step 5.  -coldiag2- again shows the condition number
for the regression in step 5 is huge, though a bit smaller: 1,658,268

Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not
generate a collinearity, and in step 5 -regress- drops nothing.
-coldiag2- shows the condition number for the regression in step 5 is
much smaller, but still way above the rule of thumb that ">30 means
collinearity problems": 538

My first question - why does the Stata method "work"?

My second question - *does* the Stata method work?  Or does rescaling
followed by raising to the 2nd, 3rd and 4th power introduce numerical
inaccuracies that cause what is a "genuine" near-collinearity to
decrease so much that Stata's -regress- doesn't detect it?

Any ideas?  It's not because I'm using floats.  Doubles everywhere.

--Mark

***************** Example output **************** . version 8.2

. version
version 8.2

. which ovtest
C:\Stata8\ado\base\o\ovtest.ado
*! version 2.3.6  05sep2001

. which coldiag2
c:\ado8\plus\c\coldiag2.ado
*! version 2.0, 01Dec2004, [email protected]

. 
. use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear
(Wages of Very Young Men, Zvi Griliches, J.Pol.Ec. 1976)

. 
. * Generate yhats
. qui regress lw s  

. qui predict double yhat

. * yhatr=rescaled yhat
. sum yhat, meanonly

. qui gen double yhatr = (yhat-r(min))/(r(max)-r(min))

. qui gen double yhat2=yhat^2

. qui gen double yhat3=yhat^3

. qui gen double yhat4=yhat^4

. * yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 . qui gen
double yhatr2=yhatr^2

. qui gen double yhatr3=yhatr^3

. qui gen double yhatr4=yhatr^4

. * yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r .
sum yhat2, meanonly

. qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))

. sum yhat3, meanonly

. qui gen double yhat3r = (yhat3-r(min))/(r(max)-r(min))

. sum yhat4, meanonly

. qui gen double yhat4r = (yhat4-r(min))/(r(max)-r(min))

. * Summarize variables
. sum lw s yhat*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          lw |       758    5.686739    .4289494      4.605      7.051
           s |       758    13.40501    2.231828          9         18
        yhat |       758    5.686739    .2156493   5.261107   6.130727
       yhatr |       758    .4894459    .2479809          0          1
       yhat2 |       758    32.38544    2.473995   27.67924   37.58581
-------------+--------------------------------------------------------
       yhat3 |       758    184.7003     21.3139   145.6234   230.4284
       yhat4 |       758    1054.929    163.4247   766.1405   1412.693
      yhatr2 |       758    .3009707    .2769761          0          1
      yhatr3 |       758    .2142687    .2681644          0          1
      yhatr4 |       758    .1671979    .2530434          0          1
-------------+--------------------------------------------------------
      yhat2r |       758    .4750582    .2497327          0          1
      yhat3r |       758    .4607848    .2513286          0          1
      yhat4r |       758    .4466593     .252763          0          1

. 
. * Quadratic form of RESET
. * 1.  Unrescaled RESET
. * Collinearity appears
. qui regress lw s  yhat2 yhat3 yhat4

. testparm yhat2 yhat3 yhat4

 ( 1)  yhat2 = 0
 ( 2)  yhat3 = 0
 ( 3)  yhat4 = 0
       Constraint 1 dropped

       F(  2,   754) =    0.87
            Prob > F =    0.4191

. * 2.  yhat that is first ^2, ^3, ^4, then rescaled . * Collinearity
appears . qui regress lw s  yhat2r yhat3r yhat4r

. testparm yhat2r yhat3r yhat4r

 ( 1)  yhat2r = 0
 ( 2)  yhat3r = 0
 ( 3)  yhat4r = 0
       Constraint 1 dropped

       F(  2,   754) =    0.87
            Prob > F =    0.4191

. * 3.  yhat that is first rescaled, then ^2, ^3, ^4 . * No collinearity
. qui regress lw s  yhatr2 yhatr3 yhatr4

. testparm yhatr2 yhatr3 yhatr4

 ( 1)  yhatr2 = 0
 ( 2)  yhatr3 = 0
 ( 3)  yhatr4 = 0

       F(  3,   753) =    0.59
            Prob > F =    0.6216

. * 4.  Stata's built-in ovtest
. *     Matches first-rescaled-then-powered, i.e., (3)
. qui regress lw s 

. ovtest

Ramsey RESET test using powers of the fitted values of lw
       Ho:  model has no omitted variables
                 F(3, 753) =      0.59
                  Prob > F =      0.6216

. 
. * Collinearities
. _rmcoll s  yhat2 yhat3 yhat4
note: yhat2 dropped due to collinearity

. _rmcoll s  yhat2r yhat3r yhat4r
note: yhat2r dropped due to collinearity

. _rmcoll s  yhatr2 yhatr3 yhatr4

. 
. * -coldiag2-
. coldiag2 s  yhat2 yhat3 yhat4

Condition number using scaled variables =   7454604.11

Condition Indexes and Variance-Decomposition Proportions

    condition
        index _cons     s yhat2 yhat3 yhat4

1        1.00  0.00  0.00  0.00  0.00  0.00 
2       16.73  0.00  0.00  0.00  0.00  0.00 
3      359.37  0.00  0.00  0.00  0.00  0.00 
4    33354.15  0.00  0.00  0.00  0.00  0.00 
5  7454604.11  1.00  1.00  1.00  1.00  1.00 


. coldiag2 s  yhat2r yhat3r yhat4r

Condition number using scaled variables =   1658268.08

Condition Indexes and Variance-Decomposition Proportions

    condition
        index  _cons      s yhat2r yhat3r yhat4r

1        1.00   0.00   0.00   0.00   0.00   0.00 
2        4.69   0.00   0.00   0.00   0.00   0.00 
3      168.37   0.00   0.00   0.00   0.00   0.00 
4    12145.95   0.00   0.00   0.00   0.00   0.00 
5  1658268.08   1.00   1.00   1.00   1.00   1.00 


. coldiag2 s  yhatr2 yhatr3 yhatr4

Condition number using scaled variables =       538.15

Condition Indexes and Variance-Decomposition Proportions

condition
    index  _cons      s yhatr2 yhatr3 yhatr4

>  
1    1.00   0.00   0.00   0.00   0.00   0.00 
2    2.38   0.00   0.00   0.00   0.00   0.00 
3   14.57   0.00   0.00   0.00   0.00   0.00 
4   96.54   0.08   0.06   0.00   0.02   0.05 
5  538.15   0.92   0.94   1.00   0.98   0.94 

*********** do file to generate output **************

version 8.2
version
which ovtest
which coldiag2

use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear

* Generate yhats
qui regress lw s
qui predict double yhat
* yhatr=rescaled yhat
sum yhat, meanonly
qui gen double yhatr = (yhat-r(min))/(r(max)-r(min)) qui gen double
yhat2=yhat^2 qui gen double yhat3=yhat^3 qui gen double yhat4=yhat^4
* yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 qui gen
double yhatr2=yhatr^2 qui gen double yhatr3=yhatr^3 qui gen double
yhatr4=yhatr^4
* yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r sum
yhat2, meanonly qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))
sum yhat3, meanonly qui gen double yhat3r =
(yhat3-r(min))/(r(max)-r(min)) sum yhat4, meanonly qui gen double yhat4r
= (yhat4-r(min))/(r(max)-r(min))
* Summarize variables
sum lw s yhat*

* Quadratic form of RESET
* 1.  Unrescaled RESET
* Collinearity appears
qui regress lw s  yhat2 yhat3 yhat4
testparm yhat2 yhat3 yhat4
* 2.  yhat that is first ^2, ^3, ^4, then rescaled
* Collinearity appears
qui regress lw s  yhat2r yhat3r yhat4r
testparm yhat2r yhat3r yhat4r
* 3.  yhat that is first rescaled, then ^2, ^3, ^4
* No collinearity
qui regress lw s  yhatr2 yhatr3 yhatr4
testparm yhatr2 yhatr3 yhatr4
* 4.  Stata's built-in ovtest
*     Matches first-rescaled-then-powered, i.e., (3)
qui regress lw s
ovtest

* Collinearities
_rmcoll s  yhat2 yhat3 yhat4
_rmcoll s  yhat2r yhat3r yhat4r
_rmcoll s  yhatr2 yhatr3 yhatr4

* -coldiag2-
coldiag2 s  yhat2 yhat3 yhat4
coldiag2 s  yhat2r yhat3r yhat4r
coldiag2 s  yhatr2 yhatr3 yhatr4


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Stata/SE vs. Stata/MP
Next by Date: st: If in Mata
Previous by thread: st: Statistical Classics Archive: Stata 1.0 Manual
Next by thread: Re: st: Interesting numerical accuracy/collinearity issue
Index(es):
- Date
- Thread