Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Comparison of results from Stata


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: Comparison of results from Stata
Date   Mon, 07 Feb 2011 10:27:57 -0600

In a previous exchange, Matthew discovered differences between 
-regress- and using Mata invsym(X'X)*X'y, and I explained those 
differences in terms of finite precision and, in particular, traced 
certain discrepancies down to Matthew's inadvertant use of -float- 
rather than -double-.  Along the way, we discovered that -regress- 
was more accurate than invsym(X'X)*X'y.

Matthew Baker <matthew.baker@hunter.cuny.edu> writes,

> this begs the question: what is "reg" doing 
> differently? While running your code makes the errors much 
> closer, they still aren't identical - why is that?

Matthew is imagining there is a single correct answer.  Well, there is 
a single correct answer mathematically.  Computers, however, perform 
calculations with a limited number of digits, aka finite precision, and 
the results they produce are close to, but not identical to, the single
mathematical answer.   As a result, two different formulas for calculating 
a result, formulas that would produce mathematically identical results, 
can and do produce different results when performed on a finite-precision 
computer.  An important part of the scientific programming StataCorp does 
is evaluating those differences and finding fomrulas which, when performed
using finite precision arithmetic, produces results closer to the true 
mathematical result.

In Mata, Matthew obtained regression coefficients using the classic 
mathematical formula (X'X)^(-1)X'y, which he coded in Mata as 
invsym(X'X)*X'y.  

Stata's -regress- uses different formulas to obtain b.  Performed with 
infinite precision, the formulas -regress- uses would produce the same,
mathematically true result as (X'X)^(-1)X'y.  Performed in finite precision,
the fomulas -regress- uses produce results closer to the matematical answer.

So what does -regress- do?

First, rather than calculating (an approximiation to) (X'X)^(-1), which 
can then be used in the formula (X'X)^(-1)X'y directly to obtain b, Stata uses
a solver, and solves the equation 

	(X'X)b = X'y

for b. 

Premultiply both sides of the above equation by (X'X)^(-1) and you will 
obtain b = (X'X)^(-1)*X'y, but that is not how a solver works.  The 
particular solver Mata uses is based on LR decomposition.  
Mata provides this capability -- see -help mata lusolve()- -- so Matthew 
could have used it, but -lusolve()- is more trouble than using
syminv(X'X)*X'y, and really, there's no reason he should have used it.
syminv(X'X)*X'y produces an excellent answer in this case.  EVen so, 
lusolve() would have produced a better answer.

If Matthew wanted to implement using lusolve() in all cases, I should warn 
him that LU decomposition assumes that matrix is full rank.  Matthew will 
need to think about how he will detect collinearity and reduce the problem 
before giving it over to the LU solver. 

Before Matthew does that, however, I need to mention that use of a
solver rather than an inverter is just the icing on the cake.  Before
he even gets to solving (X'X)b=X'y for b, Matthew needs to think
carefully about the calculation of X'X.

Finite-precision calculation of X'X is a disaster in the making.  The 
problem is that computers don't add and subtract numbers on different 
scales as accurately as you might expect.  Just forming the sum of 
x1+x2+x3+...+xN to obtain a mean requires caution.  The obvious way 
to proceed is to first add x+x2, then add x3 to that, and so on.  
If all the x's are greater than 0, the sequence (x1, x1+x2, x1+x2+x3, ...)
grows, and the scale of x1+x2 is roughly twice than of x3, 
x1+x3+x3 is roughly three times that of x4, and so on.  The result is 
that each subsequent addition is performed less and less accurately on 
a finite-precision machine.

I could write a lot about that problem, but let me skip all that and
just say that one thing that well help dramatically is to remove the
mean from each of x1, x2, x3, ..., and so calculate ((x1-mean),
(x1-mean)+(x2-mean), ...).  It will not eliminate the problem, but it
will be a big help.

Before Matthew can adopt that approach, however, he will need to think
about how he is going to obtain an accurate apporiximation of the
mean, which is itself based on x1+x2+...+xN.  I also note, Matthew
should think about how much that will matter, which is to say, how
error in the calculation of the mean will filter through the rest of
the calculation.  Again, I'll skip a lengthy discussion and merely say
that there are ways of calculating a sum that are more accurate than
summing the elements.  Better is to calculate a running approximation
of the mean, and update that.  To wit, first I approximate the mean as
m1 = x1, then I approximate it as m2 = (x1+x2)/2, then I approximate
it as m3 = (m2*(2/3)+x3/3), and so on.  Other things Matthew might
want to think about to obain more precise result is putting x is
ascending order and/or making the calculation in quad rather than
double precision.

Mata will do all of that, too; see cross(), crossdev(), and quadcross(), 
and quadcrossdev().

Now, if Matthew removes the mean, he will be able to obtain b1 and b2
for the linear regression, 

        y = b0 + b1*x1 + b2*x2

but he will need to develop an accurate calculation formula for b0.
There's an obvious one staring you in the fact, assuming you have
accurate renditions of the means of y, x1, and x2.  Obtaining the
standard errors for the coefficients is a entirely new problem.

And that's what we do at StataCorp.

The few extra digits we of accuracy we get are not the point of the 
exercise.  We go through all of these macinations so that, when you 
give us a problem that is poorly scaled, or with lots of collinearity, 
problems where syminv(X'X)*X'y would produce inaccurate results, 
we produce results that are still close to the mathematically true 
answer.  The few extra digits of accuracy that you observe in most 
problems come for free.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index