# st: data set larger than RAM

 From Thomas Cornelißen To Subject st: data set larger than RAM Date Wed, 2 Nov 2005 16:54:17 +0100

Dear Stata Users,
suppose you had K=5000 regressors and N=10 million observations. This might be from a linked employer employee data set where you explicitly include all firm dummies (and algebraically sweep out person effects through the within transformation) in order to compute person and firm effects.

As I understand, Stata SE is capable of using up to 11 000 variables. But in this case the data matrix would be 50 GB assuming that each regressor can be stored as a 1 byte variable.

What do you think of the following solutions, that necessarily will require to use the data partly form the hard disk and not from the RAM:

1) Store the data in several files, two of which can be loaded into the memory at a time. Compute the elements of X'X and X'Y by subsequently loading all possible pairwise combinations of the data sets into the memory and multyplying the data matrices. Store the results and use them to set together X'y and X' X, which is then a 5000 x 5000 matrix that fits into the memory and can be inverted by Stata. As I might with some luck have 16 GB of RAM available, this would mean about 7 sub datasets of about 7 GB each (i.e. about 700 regressors and 10 million observations each sub data set).

2) I read that SAS is able to handle datasets that do not fit into the RAM. Does SAS do something like what I described under 1) ? Is there a good discussion forum about SAS that might give advice on whether such a large dataset can be handled by the software? (I am sorry to aks this on Statalist...)

3) I have the feeling that with 1) and 2) I exchange the space restriction for a time restriction. Any solution, 1) or 2) might be excessively time consuming, so that it is just illusionary to compute least squares estimates from such a huge dataset. Before computing X'X I need to "time-demean" the firm dummies (within-transformation), which might also be very time-comsuming, when there are 5000 regressors and 2 million persons observed during 5 time periods. Supposedly the time restriction is real, that's why several authors propose alternative estimation methods for person and firm effects in linked employer employee data. Do you have any suggestions about how to estimate the time needed ?

Has anybody had a similar problem or thought this through? I would be glad for any comments.
Regards,
Thomas

-------------------------------------------------------------------------
Thomas Cornelissen
Institute of Empirical Economic Research
University of Hannover, Germany
cornelissen@mbox.iqw.uni-hannover.de
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/