This is a follow-up on my previous question "st: data set larger than RAM"
on handling large datasets (in the order of 10 millions of observations).
Bill Gould adviced me to worry about numerical accuracy when using such
large datasets.
If I understood it right: Using the conventional regression tools may lead
to inaccurate results in such large datasets due to problems of numerical
precision when summing up so many numbers.
I was adviced to
- use quad precision, for instance quadcross() available in Mata
- normalize the variables to mean 0 and variance 1
- use solver functionality instead of inverses
- take much care and double-check results
I have two follow-up questions about this:
(1) Can I improve on this strategy if I replace quadcross() by doing the
cross product sums manually using the "mean update rule" that Bill mentioned
in the previous discussion? Or does quadcross() itself employ the mean
update rule or take care of the problem by proceeding from the smallest to
the largest number when summing up?
(2) Should I also normalize dummy variables and categorial variables to mean
0 and variance 1 when computing X'X ? (I worry that this changes categorial
variables from integers to real numbers and therefore increase the memory
space needed.)
I would be glad for further comments or advice on these issues.
Thomas