[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould, Stata) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: data set larger than RAM |

Date |
Thu, 10 Nov 2005 12:28:56 -0600 |

Thomas Cornelissen <cornelissen@mbox.iqw.uni-hannover.de> wrote, > This is a follow-up on my previous question "st: data set larger than RAM" > on handling large datasets (in the order of 10 millions of observations). > > [...] > > > I was adviced to > - use quad precision, for instance quadcross() available in Mata > - normalize the variables to mean 0 and variance 1 > - use solver functionality instead of inverses > - take much care and double-check results Yes, in a nutshell. In terms of double-checking results, I suggested drawing a random subsample of the data (say 50,000, or 100,000, or 200,000 obs., as convenient) and then using standard Stata commands. The results Thomas produces with the full dataset should be similar to those produced with the subsample. Hence, Thomas may decide not to do all of the above. He might, for instance, omit normalization. If he finds similar results, then that would be adequate. > (1) Can I improve on this strategy if I replace quadcross() by doing the > cross product sums manually using the "mean update rule" that Bill mentioned > in the previous discussion? Or does quadcross() itself employ the mean > update rule or take care of the problem by proceeding from the smallest to > the largest number when summing up? With a million observations, use quadcross(). It would not be worth the effort to implement the mean-update rule. The advantage of the mean-update rule is savings in memory and, perhaps, computer time. There is a value of N such that, for n>N, the mean-update rule will be more accurate even than quad precission, but not a mere million observations. > (2) Should I also normalize dummy variables and categorial variables to mean > 0 and variance 1 when computing X'X ? (I worry that this changes categorial > variables from integers to real numbers and therefore increase the memory > space needed.) That is not necessary. Given you are using quad precission for the matrix sum, you may be able to skip normalization altogether, especially if your variables have reasonable (approximately equal) means. You want to avoid having one variable with a mean of .0001 and another with a mean of 3,960,100 (which is 1990^2, and I have seen people include calendar-year squared in regressions). If all variables have means between 0 and 100, I would not expect difficulty. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: extracting the variable name from e(wtype) container after regrress** - Next by Date:
**st: RE: extracting the variable name from e(wtype) container after regrress** - Previous by thread:
**Re: st: data set larger than RAM** - Next by thread:
**Re: st: data set larger than RAM** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |