Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
wgould@stata.com (William Gould, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Re: Comparison of results from Stata |

Date |
Mon, 07 Feb 2011 10:27:57 -0600 |

In a previous exchange, Matthew discovered differences between -regress- and using Mata invsym(X'X)*X'y, and I explained those differences in terms of finite precision and, in particular, traced certain discrepancies down to Matthew's inadvertant use of -float- rather than -double-. Along the way, we discovered that -regress- was more accurate than invsym(X'X)*X'y. Matthew Baker <matthew.baker@hunter.cuny.edu> writes, > this begs the question: what is "reg" doing > differently? While running your code makes the errors much > closer, they still aren't identical - why is that? Matthew is imagining there is a single correct answer. Well, there is a single correct answer mathematically. Computers, however, perform calculations with a limited number of digits, aka finite precision, and the results they produce are close to, but not identical to, the single mathematical answer. As a result, two different formulas for calculating a result, formulas that would produce mathematically identical results, can and do produce different results when performed on a finite-precision computer. An important part of the scientific programming StataCorp does is evaluating those differences and finding fomrulas which, when performed using finite precision arithmetic, produces results closer to the true mathematical result. In Mata, Matthew obtained regression coefficients using the classic mathematical formula (X'X)^(-1)X'y, which he coded in Mata as invsym(X'X)*X'y. Stata's -regress- uses different formulas to obtain b. Performed with infinite precision, the formulas -regress- uses would produce the same, mathematically true result as (X'X)^(-1)X'y. Performed in finite precision, the fomulas -regress- uses produce results closer to the matematical answer. So what does -regress- do? First, rather than calculating (an approximiation to) (X'X)^(-1), which can then be used in the formula (X'X)^(-1)X'y directly to obtain b, Stata uses a solver, and solves the equation (X'X)b = X'y for b. Premultiply both sides of the above equation by (X'X)^(-1) and you will obtain b = (X'X)^(-1)*X'y, but that is not how a solver works. The particular solver Mata uses is based on LR decomposition. Mata provides this capability -- see -help mata lusolve()- -- so Matthew could have used it, but -lusolve()- is more trouble than using syminv(X'X)*X'y, and really, there's no reason he should have used it. syminv(X'X)*X'y produces an excellent answer in this case. EVen so, lusolve() would have produced a better answer. If Matthew wanted to implement using lusolve() in all cases, I should warn him that LU decomposition assumes that matrix is full rank. Matthew will need to think about how he will detect collinearity and reduce the problem before giving it over to the LU solver. Before Matthew does that, however, I need to mention that use of a solver rather than an inverter is just the icing on the cake. Before he even gets to solving (X'X)b=X'y for b, Matthew needs to think carefully about the calculation of X'X. Finite-precision calculation of X'X is a disaster in the making. The problem is that computers don't add and subtract numbers on different scales as accurately as you might expect. Just forming the sum of x1+x2+x3+...+xN to obtain a mean requires caution. The obvious way to proceed is to first add x+x2, then add x3 to that, and so on. If all the x's are greater than 0, the sequence (x1, x1+x2, x1+x2+x3, ...) grows, and the scale of x1+x2 is roughly twice than of x3, x1+x3+x3 is roughly three times that of x4, and so on. The result is that each subsequent addition is performed less and less accurately on a finite-precision machine. I could write a lot about that problem, but let me skip all that and just say that one thing that well help dramatically is to remove the mean from each of x1, x2, x3, ..., and so calculate ((x1-mean), (x1-mean)+(x2-mean), ...). It will not eliminate the problem, but it will be a big help. Before Matthew can adopt that approach, however, he will need to think about how he is going to obtain an accurate apporiximation of the mean, which is itself based on x1+x2+...+xN. I also note, Matthew should think about how much that will matter, which is to say, how error in the calculation of the mean will filter through the rest of the calculation. Again, I'll skip a lengthy discussion and merely say that there are ways of calculating a sum that are more accurate than summing the elements. Better is to calculate a running approximation of the mean, and update that. To wit, first I approximate the mean as m1 = x1, then I approximate it as m2 = (x1+x2)/2, then I approximate it as m3 = (m2*(2/3)+x3/3), and so on. Other things Matthew might want to think about to obain more precise result is putting x is ascending order and/or making the calculation in quad rather than double precision. Mata will do all of that, too; see cross(), crossdev(), and quadcross(), and quadcrossdev(). Now, if Matthew removes the mean, he will be able to obtain b1 and b2 for the linear regression, y = b0 + b1*x1 + b2*x2 but he will need to develop an accurate calculation formula for b0. There's an obvious one staring you in the fact, assuming you have accurate renditions of the means of y, x1, and x2. Obtaining the standard errors for the coefficients is a entirely new problem. And that's what we do at StataCorp. The few extra digits we of accuracy we get are not the point of the exercise. We go through all of these macinations so that, when you give us a problem that is poorly scaled, or with lots of collinearity, problems where syminv(X'X)*X'y would produce inaccurate results, we produce results that are still close to the mathematically true answer. The few extra digits of accuracy that you observe in most problems come for free. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Re: Comparison of results from Stata***From:*Matthew J Baker <matthew.baker@hunter.cuny.edu>

- Prev by Date:
**Re: st: bootstrap saving option** - Next by Date:
**st: Mata performance monitoring** - Previous by thread:
**re: Re: st: Re: Comparison of results from Stata** - Next by thread:
**Re: st: Re: Comparison of results from Stata** - Index(es):