[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Thomas Cornelißen <cornelissen@mbox.iqw.uni-hannover.de> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: data set larger than RAM |

Date |
Wed, 2 Nov 2005 16:54:17 +0100 |

Dear Stata Users,

suppose you had K=5000 regressors and N=10 million observations. This might be from a linked employer employee data set where you explicitly include all firm dummies (and algebraically sweep out person effects through the within transformation) in order to compute person and firm effects.

As I understand, Stata SE is capable of using up to 11 000 variables. But in this case the data matrix would be 50 GB assuming that each regressor can be stored as a 1 byte variable.

What do you think of the following solutions, that necessarily will require to use the data partly form the hard disk and not from the RAM:

1) Store the data in several files, two of which can be loaded into the memory at a time. Compute the elements of X'X and X'Y by subsequently loading all possible pairwise combinations of the data sets into the memory and multyplying the data matrices. Store the results and use them to set together X'y and X' X, which is then a 5000 x 5000 matrix that fits into the memory and can be inverted by Stata. As I might with some luck have 16 GB of RAM available, this would mean about 7 sub datasets of about 7 GB each (i.e. about 700 regressors and 10 million observations each sub data set).

2) I read that SAS is able to handle datasets that do not fit into the RAM. Does SAS do something like what I described under 1) ? Is there a good discussion forum about SAS that might give advice on whether such a large dataset can be handled by the software? (I am sorry to aks this on Statalist...)

3) I have the feeling that with 1) and 2) I exchange the space restriction for a time restriction. Any solution, 1) or 2) might be excessively time consuming, so that it is just illusionary to compute least squares estimates from such a huge dataset. Before computing X'X I need to "time-demean" the firm dummies (within-transformation), which might also be very time-comsuming, when there are 5000 regressors and 2 million persons observed during 5 time periods. Supposedly the time restriction is real, that's why several authors propose alternative estimation methods for person and firm effects in linked employer employee data. Do you have any suggestions about how to estimate the time needed ?

Has anybody had a similar problem or thought this through? I would be glad for any comments.

Regards,

Thomas

-------------------------------------------------------------------------

Thomas Cornelissen

Institute of Empirical Economic Research

University of Hannover, Germany

cornelissen@mbox.iqw.uni-hannover.de

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: data set larger than RAM***From:*SamL <saml@demog.berkeley.edu>

- Prev by Date:
**st: Regressing with variables with missing values** - Next by Date:
**st: Missing data examples/tutorials** - Previous by thread:
**st: Regressing with variables with missing values** - Next by thread:
**Re: st: data set larger than RAM** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |