Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data set larger than RAM

From   Thomas Cornelißen <>
To   <>
Subject   Re: st: data set larger than RAM
Date   Thu, 10 Nov 2005 17:01:32 +0100

This is a follow-up on my previous question "st: data set larger than RAM" on handling large datasets (in the order of 10 millions of observations).

Bill Gould adviced me to worry about numerical accuracy when using such large datasets.

If I understood it right: Using the conventional regression tools may lead to inaccurate results in such large datasets due to problems of numerical precision when summing up so many numbers.

I was adviced to
- use quad precision, for instance quadcross() available in Mata
- normalize the variables to mean 0 and variance 1
- use solver functionality instead of inverses
- take much care and double-check results

I have two follow-up questions about this:
(1) Can I improve on this strategy if I replace quadcross() by doing the cross product sums manually using the "mean update rule" that Bill mentioned in the previous discussion? Or does quadcross() itself employ the mean update rule or take care of the problem by proceeding from the smallest to the largest number when summing up?

(2) Should I also normalize dummy variables and categorial variables to mean 0 and variance 1 when computing X'X ? (I worry that this changes categorial variables from integers to real numbers and therefore increase the memory space needed.)

I would be glad for further comments or advice on these issues.

Thomas Cornelissen
Institute of Empirical Economic Research
University of Hannover, Germany
* For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index