[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Overriding dropping of collinear variables

From	[email protected] (William Gould, Stata)
To	[email protected]
Subject	Re: st: Overriding dropping of collinear variables
Date	Wed, 09 Jul 2003 09:03:44 -0500
James Valcour <[email protected]> asked, 

> Is there a way to tell Stata (either v7 or v8) not to drop collinear
> variables?  Usually I wouldn't try and do this, but I'm trying to compare
> some output produced from SAS's proc genmod with some glm output from Stata.
> SAS doesn't automatically drop collinear variables.  I'm trying to do this
> because all the examples from a course I recently took were in SAS and I'm
> trying to see if I can get the same results from Stata.

Actually, I believe that SAS too will drop collinear variables and I suspect
this is a case where Stata is declaring a variable collinear and SAS is not.
Scott Merryman <[email protected]>, in responding to the question by James,
noted that SAS will produce the message 

    NOTE: The X'X matrix has been found to be singular and a generalized
    inverse was used to solve the normal equations. Estimates followed by the
    letter 'B' are biased, and are not unique estimators of the parameters.

and then worried that employing a generalized inverse as a solution might be
dangerous.  The fact is that the solution Stata implements can also be viewed
(and is) a generalized-inverse solution.  There are lots of "generalized"
inverses of which dropping the variables is one.


When are variables collinear?
-----------------------------

In textbooks authors write about "perfectly collinear" variables, by which
they mean the correlation between the two variables is exactly 1.  For
instance, the following two variables are perfectly collinear:

            x1          x2
             1           2 
             2           4
             3           6

In the real world of statistical computing things are seldom so clear cut.
Computers work in binary and you think in decimal, meaning that when you input
6.1, the computer does not really store 6.1 exactly.  From those numbers,
user's generate calculated values, such as x2^2.  On top of that, the
finite-precision calculations subsequently may lead to round-off error.  By
the time the computer is studying the problem, calculation does not lead to
clear-cut 0s and 1s, but to numbers like 1e-12 and .9999999999997.  It is from
numbers like those that the decision has to be made.

How one makes that decision depends not only on the amount of numerical 
round-off error -- something one can analyze and have good knowledge about -- 
but also on the original accuracy to which the data were measured.
Consider the following data:

            x1          x2
             1           2 
             2           4
             3           6.000000000001

Tell me that the measurement was made by a physicist in a certain context, and
I might actually believe he or she measured 6.000000000001.  The two sequences
are not collinear and something very small is going on.  Tell me they were
made from economic data, and I will immediately suspect that the .000000000001
part is roundoff error from some earlier calculation.

My point is that there is no right answer and so, in a few cases, it will
not surprise me if Stata and SAS disagree.


Changing when Stata declares variables collinear
------------------------------------------------

There is an undocumented way you can control when Stata will determine 
collinearity.  In Stata-undocumented-speak it is called "tol 1" or, in the
cases of -anova- and -manova-, "tol 2".  "tol 1" affects how Stata inverts
matrices in all cases except -anova- and -manova-.  "tol 1" is irrelevant
in the cases of -anova- and -manova-, and "tol 2" is the relevant parameter.

The default values of these two parameters are 

        tol 1  =  1.0e-9
        tol 2  =  1.0e-8

You can reset them.  If you make them smaller, Stata will be less likely to 
declare collinearity.  Set them larger, and Stata will be more likely.

I am about to tell you how to set them but I warn you, reset them and we wash 
our hands of you.  Set the number too small, and you might cause Stata to 
crash.  Set the number larger than that, but still too small, and in truly
collinear cases, you can end up with estimates based on nothing but numerical 
round off error.  Set the number too large, and Stata will drop variables 
left and right.

I cannot tell you, however, that we hve set the numbers right.  What I can
tell you is that we have carefully considered the problem and that Stata now
has a long history of using the numbers as we have set them, a history
incorporating literally millions of matrix inversions, and users have 
seldom complained.

That said, here's how you can reset tol 1 to 1.0e-6:

        . set debug on 
        . set tol 1 1e-6
        . set debug off

You can reset tol 2 similarly.  

The -set tol- and -set debug- commands are undocumented, "secret" commands.
-set tol- will not work unless Stata is in -debug- mode; this way, no one 
can accidently change these critical values.  I recommend against running
Stata in -debug- mode because some commands produce a lot of output that you
do not want to see.

To reset tol 1 back to its officially endorsed value, you could 

         . set debug on 
         . set tol 1 1e-9
         . set debug off

but I recommend simply exiting and relaunching Stata.  That is to say, nothing
you do resetting tol 1 or tol 2 will change Stata permanently.

If you are trying to change the behavior of -anova- or -manova-, change 
"tol 2" rather than "tol 1".

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: Quantile regression with weights
Next by Date: st: -linkplot- available on SSC
Previous by thread: st: Re: Overriding dropping of collinear variables
Next by thread: st: svytab
Index(es):
- Date
- Thread