[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould, Stata) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: data set larger than RAM |

Date |
Thu, 03 Nov 2005 10:29:06 -0600 |

Thomas Cornelissen <cornelissen@mbox.iqw.uni-hannover.de> asks a question that begins, > Suppose you had K=5000 regressors and N=10 million observations. I have a lot to say about that opening statement. The summary will be that this is not just a memory problem, but that very large datasets not just quantitatively different from smaller datasets, they are qualitatively different, too. These break down into Numerical Issues and Substantive Issues. Numerical issues ---------------- I am usually pretty relaxed about numerical issues. StataCorp is not, but I, personally, am. Many of you learned that regression coefficients can be calculated via b = (X'X)^(-1)X'y. They can be, but those who worry about numerical accuracy, do not use that formula. The first inaccuracy has to do with simply forming X'X and X'y. It can be formed more accurately if you first remove the means, and so calculate (X-xbar)'(X-Xbar) and (X-Xbar)'(y-ybar). In many problems, this makes a world of difference. One can go further, too, and normalize each of the variables to have mean 0 *AND* to have variance 1. That will help numerically, but there are arguments both ways considering whether that is desirable. I don't want to get into that right now, except to say that I, personally, am not a fan of the extra normalization given how most people use statistical packages. Stata does not do the variance=1 part. If it did, in many of cases where Stata drops a variable as collinear and you are happy about that, Stata would not. On the other hand, there are a few cases where Stata drops a variable and someone wishes it had been left in. I have never seen a real case of that, however, meaning a real researcher working with real data. Numerical analyst friends have concocted cases for me. The second inaccuracy as to do with forming the inverse of (X-Xbar)'(X-Xbar). Stata forms (X-Xbar)'(X-Xbar) and (X-Xbar)'(y-ybar), but then it does *NOT* use the formula ((X-xbar)'(X-bar))^(-1)(X-xbar)'(y-ybar) to obtain b. Rather, it uses a "solve" approach: Solve ((X-xbar)'X-Xbar))b = (X-Xbar)'(y-ybar) that results in many less additions being carried out and results in a more accurate answer. The third inaccuracy has to do with the calculations for the intercept, but never mind. Stata does that well, too. All that said, in most cases, b = (X'X)^(-1)X'y usually produces an accurate answer. It produces an accurate answer if (1) the data are not too correlated, (2) the means of X and y are all roughly equal, and (3) there are not too many observations. Okay, so Stata looks to be very careful about how it makes the linear-regression calculation, so once we are around the memory problem, we are set to run regressions on 10 million, or a billion observations, right? No. As the number of obsrvations grows, just calculating a sum becomes problematic. Stata calculates sums by addition in the usual way. If Stata needs to sum x1, x2, x3, ..., Stata adds them. For n very, very large, that does not work well. This is what I meant when I said very large datasets are qualitatively, not just quantitatively, different. A new problem arises. First, let's understand the problem. To add numbers, computers must "normalize" them. If I want to add 1e+5 and 1e-5, I must first write them so that the digits line up: 10000 .0001 ---------- 10000.0001 That's called normalization; it's just a fancy word for something we all do without even thinking about it. But computers have finite accuracy. Pretend our computer had 7 digits of accuracy. That means it can record only 7 digits, and we are left with: 12345 67| | 10000.00| .00|01 ----------- 10000.00| On this computer, 1e+5+1e-5 = 1e+5. The 1e-5 vanishes! Does this problem matter? A little loss of 1e-5 in a number as big as 1e+5 is not much, I admit. Now let's consider adding the following numbers 1e+5 + 1e-5 + 1e-5 + 1e-5 + ... + 1e-5 ------------------------------- assume I have 1 million of these numbers The true sum is 1e+5 + (1e+6)*(1e-5) is 10,010. What will the computer calculation yield? Let's proceed from left to right: First, we add 1e+5 to 1e-5 to obtain 1e+5, then we add 1e-5 to obtain 1e+5, ..., and we obtain 1e+5, or 10,000. We have so much error that adding the terms did absolutely nothing! If we proceeded from right to left, we would get 10,010. The order in which we add is important. In a very large dataset, the accurate way to sum x1, x2, ..., is to order them from smallest to largest, and then add those, left to right. There are other, less accurate but less computationally intensive ways to address this problem, too. One is the mean update rule: If we knew the mean accurately, we could multiply by N to obtain the sum. Let M(i) be the mean of the first i elements. Then, M(i+1) = M(i)*((i)/(i+1) + x_i/(i+1) Now let's return to real life. Stata does not use the mean-update rule in linear regression, and I know of no package that does. It doesn't matter for the size of datasets we usually deal with. In terms of the formulas above, we use computers that provide 8-byte doubles, and that amounts to approximately 16 decimal digits of accuracy. Increasing the number of digits, as I have shown, aleviates the problem. Stata, when it calculates sums, uses double precision and sometimes, when Stata is especially worried about the problem (as it is with -stcox-), it uses quad precision. Quad precision provides a bit of 32 decimal digits. (Stata is unique or rather unique on its use of quad precision. I say rather unique because I assume some other package must, but I not know of one.) There are other numeric problems with very large datasets as well, which I will not go into here. They may turn out to be important, but they are usually minor compared to the summation problem. Substantive issues ------------------ Let us assume that we somehow get a b which is an accurate representation of (X'X)^(-1)X'y performed in infinite precision, and that we similarly have an accurate representation of s^2(X'X)^(-1), the variance matrix, the squareroot of the diagonal elements of which are standard errors. What do they mean? I ask you to think about statistics. The standard errors reflect the sampling error. At 10-million observations, the sampling error is not much. Standard errors are theoretical lower bounds about the amount of uncertainty, theoretical because they assume the *ONLY* source of error is sampling error, and in particular, they ignore specification error. The uncertainty about the specification probably now exceeds the sampling error. At 10 million observations, we may very well have the population. In that case, populationwise, the uncertainty is zero. Is the researcher reporting standard errors because he or she is interested in alternative universes? I doubt it: the researcher is still uncertain because he or she does not trust that he or she has estimated the correct model. The standard errors that statistics provides us do not address that issue. Advice ------ I find these fascinating problems, numerically and substantively. I also believe that these are problems of growing importance. My advice to anyone interested in breaking into the statistical software industry is to carve out these problems and start working on them, substantively and numerically. Produce software for supercomputers. Thomas probably does not want to do this, however. So Thomas is in a difficult position. I offer the following advice: 1. Whatever software you use -- Stata, SAS, ... -- do *NOT* assume you are getting correct answers on anything. You must verify as best you can. 2. Draw a reasonable-sized (but large) subsample from the data. Perform all analyses on that, and compare results to those obtained from the full dataset. 3. Make sure all variables have reasonable and roughly equal means *AND* standard deviations. These is especially important if you are not using Stata. I am *NOT* saying that these problems I have outlined are going to affect Thomas with his 10-million observation, 5,000-variable dataset. I do not know; it all depends, and it depends on too many things to outline here. One could write a book. All I can say is that Thomas is standing near the precipice. In these kinds of problems, we throw around powers of 2 and 10. For Thomas' data, the problem may not strike until 100-million observations, or a billion, but Thomas needs to be cautious. That is why I suggest (2), and I suggest (2) even more strongly if Thomas is going to make any of his own calculations. Stata provides the tools Thomas needs for the full calculation. I said to normalize variables to have mean 0, variance 1. Thomas can trust the means and stanard deviations reported by -summarize-. -summarize- is one of Stata's routines that uses quad precision. Thomas should learn about Mata. For calculating X'X and X'y, Mata provides cross() and quadcross(), the latter being quad precision. I believe that quad preicison routines are the solution to most of Thomas' problems. Thomas should learn about Mata's solvers so that he can efficiently solve for b rather than basing results on inverses. On this, too, I can offer reassurance: there are no better routines available, period, than the LAPACK-based routines that Mata provides. Good luck. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: checking if a dataset exists** - Next by Date:
**Re: st: -Table- output presented horizontally and exported** - Previous by thread:
**Re: st: data set larger than RAM** - Next by thread:
**Re: st: data set larger than RAM** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |