[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Comparison of results from Stata
firstname.lastname@example.org (William Gould, StataCorp LP)
Re: st: Re: Comparison of results from Stata
Mon, 07 Feb 2011 10:27:57 -0600
In a previous exchange, Matthew discovered differences between
-regress- and using Mata invsym(X'X)*X'y, and I explained those
differences in terms of finite precision and, in particular, traced
certain discrepancies down to Matthew's inadvertant use of -float-
rather than -double-. Along the way, we discovered that -regress-
was more accurate than invsym(X'X)*X'y.
Matthew Baker <email@example.com> writes,
> this begs the question: what is "reg" doing
> differently? While running your code makes the errors much
> closer, they still aren't identical - why is that?
Matthew is imagining there is a single correct answer. Well, there is
a single correct answer mathematically. Computers, however, perform
calculations with a limited number of digits, aka finite precision, and
the results they produce are close to, but not identical to, the single
mathematical answer. As a result, two different formulas for calculating
a result, formulas that would produce mathematically identical results,
can and do produce different results when performed on a finite-precision
computer. An important part of the scientific programming StataCorp does
is evaluating those differences and finding fomrulas which, when performed
using finite precision arithmetic, produces results closer to the true
In Mata, Matthew obtained regression coefficients using the classic
mathematical formula (X'X)^(-1)X'y, which he coded in Mata as
Stata's -regress- uses different formulas to obtain b. Performed with
infinite precision, the formulas -regress- uses would produce the same,
mathematically true result as (X'X)^(-1)X'y. Performed in finite precision,
the fomulas -regress- uses produce results closer to the matematical answer.
So what does -regress- do?
First, rather than calculating (an approximiation to) (X'X)^(-1), which
can then be used in the formula (X'X)^(-1)X'y directly to obtain b, Stata uses
a solver, and solves the equation
(X'X)b = X'y
Premultiply both sides of the above equation by (X'X)^(-1) and you will
obtain b = (X'X)^(-1)*X'y, but that is not how a solver works. The
particular solver Mata uses is based on LR decomposition.
Mata provides this capability -- see -help mata lusolve()- -- so Matthew
could have used it, but -lusolve()- is more trouble than using
syminv(X'X)*X'y, and really, there's no reason he should have used it.
syminv(X'X)*X'y produces an excellent answer in this case. EVen so,
lusolve() would have produced a better answer.
If Matthew wanted to implement using lusolve() in all cases, I should warn
him that LU decomposition assumes that matrix is full rank. Matthew will
need to think about how he will detect collinearity and reduce the problem
before giving it over to the LU solver.
Before Matthew does that, however, I need to mention that use of a
solver rather than an inverter is just the icing on the cake. Before
he even gets to solving (X'X)b=X'y for b, Matthew needs to think
carefully about the calculation of X'X.
Finite-precision calculation of X'X is a disaster in the making. The
problem is that computers don't add and subtract numbers on different
scales as accurately as you might expect. Just forming the sum of
x1+x2+x3+...+xN to obtain a mean requires caution. The obvious way
to proceed is to first add x+x2, then add x3 to that, and so on.
If all the x's are greater than 0, the sequence (x1, x1+x2, x1+x2+x3, ...)
grows, and the scale of x1+x2 is roughly twice than of x3,
x1+x3+x3 is roughly three times that of x4, and so on. The result is
that each subsequent addition is performed less and less accurately on
a finite-precision machine.
I could write a lot about that problem, but let me skip all that and
just say that one thing that well help dramatically is to remove the
mean from each of x1, x2, x3, ..., and so calculate ((x1-mean),
(x1-mean)+(x2-mean), ...). It will not eliminate the problem, but it
will be a big help.
Before Matthew can adopt that approach, however, he will need to think
about how he is going to obtain an accurate apporiximation of the
mean, which is itself based on x1+x2+...+xN. I also note, Matthew
should think about how much that will matter, which is to say, how
error in the calculation of the mean will filter through the rest of
the calculation. Again, I'll skip a lengthy discussion and merely say
that there are ways of calculating a sum that are more accurate than
summing the elements. Better is to calculate a running approximation
of the mean, and update that. To wit, first I approximate the mean as
m1 = x1, then I approximate it as m2 = (x1+x2)/2, then I approximate
it as m3 = (m2*(2/3)+x3/3), and so on. Other things Matthew might
want to think about to obain more precise result is putting x is
ascending order and/or making the calculation in quad rather than
Mata will do all of that, too; see cross(), crossdev(), and quadcross(),
Now, if Matthew removes the mean, he will be able to obtain b1 and b2
for the linear regression,
y = b0 + b1*x1 + b2*x2
but he will need to develop an accurate calculation formula for b0.
There's an obvious one staring you in the fact, assuming you have
accurate renditions of the means of y, x1, and x2. Obtaining the
standard errors for the coefficients is a entirely new problem.
And that's what we do at StataCorp.
The few extra digits we of accuracy we get are not the point of the
exercise. We go through all of these macinations so that, when you
give us a problem that is poorly scaled, or with lots of collinearity,
problems where syminv(X'X)*X'y would produce inaccurate results,
we produce results that are still close to the mathematically true
answer. The few extra digits of accuracy that you observe in most
problems come for free.
* For searches and help try: