Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Mata - generalised cross products of the form X'SX

 From Gordon Hughes To statalist@hsphsun2.harvard.edu Subject Re: st: Mata - generalised cross products of the form X'SX Date Wed, 09 Feb 2011 10:40:10 +0000

To save space I haven't reflected all of Bill Gould's response to my original posting, but I would like to pick up on the question of optimisation with respect to different ways of doing the same thing. I understand that hardware, the nature of the problem, etc matter, but others may be interested in what I have found. As background I am running Stata/MP for 2 cores under Windows XP (32 bit) and Macintosh OS-X 10.6.6 (64 bit). The machines are respectively Intel Quad Core (XP) and Intel 2 Core (Mac) with 4 GB of RAM in each case. I run into memory limitations under XP and time limitations on the Mac. My research problem is spatial panel econometrics involving quite large N and T, but my test runs hit the constraints at sample sizes much smaller than I need for real work. For spatial panel data, the calculations are structured as the sum over T of what are, in effect, GLS estimations over panel units so nrows = ncols in many cases because of the repeated application of spatial weights.

I have implemented most of the optimisations by hand including calculating X'B'BX as T'*T where T=B*X. Trying to go further:

1. I replaced --T'*T -- by --cross(T,T)--. There is no significant reduction in execution time. For a majority of cases execution takes longer with cross(), but this is not always the case. My experience is that the overhead for small problems is greater than the increase in speed for large problems, so that the gain does not justify the potential penalty.

2. Going further, suppose S is positive definite. Then, I found that applying the Cholesky factorisation S=LL' and using the variant X'SX=T'*T where T=L'*X pays off if you have to do the factorisation for other purposes (e.g. prior inversion) but is any time gains are very marginal otherwise.

3. Suppose S is square but not guaranteed to be positive definite (unfortunately my case), then the LU factorisation S=PLU means that X'SX=(X'*PL)*(U*X). My experience in this case is that there is a very definite time penalty from using this formulation rather than just using X'*S*X.

Unfortunately without the ability to monitor memory use I don't know whether --cross()-- or some of the other formulations are significantly more efficient in memory use. For anyone wondering what this is all about, the largest of my test problems involves up to 4,500 observations for 225 panel units and 20 time periods. My real data is up to 500 panel units and 50 or more time periods. The largest problem takes 4 hours to run on my Mac and the time required for a run goes up exponentially with the number of time periods - going from 8 to 20 increases execution time by more 50 times. The program spends nearly 80% of execution time in one function, so that there are huge gains from optimising this function, but I suspect that I have got to the point where I need to go back to using much larger distributed computing power.

Gordon Hughes
g.a.hughes@ed.ac.uk

Date: Tue, 08 Feb 2011 09:37:14 -0600
From: wgould@stata.com (William Gould, StataCorp LP)
Subject: Re: st: Mata - generalised cross products of the form X'SX

Gordon Hughes <G.A.Hughes@ed.ac.uk> writes,

> B.  Roughly, under what conditions might this sequence of steps
>     reduce execution time and/or memory use on the assumption
>     that the dimension of S is large relative to cols(X)?

I'm sorry, I don't have an answer, except to say that the answer will
depend on whether you are using Stata/MP because MP is parallelizes
matrix multiplication, and thus performs it faster than does
Stata/SE.  In cases were rows>>cols or cols>>rows, multiplication
time approaches theoretical limits, which is to say, execution time
becomes proportional to 1/cores.

> In a related context I have to calculate X'B'BX where B is square
> but not symmetric and found that use of --cross()-- does not save
> much (if any) time, though it is probably more efficient on memory.

Sometimes the Mata optimizer is not as smart as we wish that it were.
To calculate X'B'B'X as quickly as possible, code

T = B'X
R = T'T

That will execute more quickly than R = X'B'B'X.  When you code
X'B'B'X, Mata does not notice that X'B'B'x = (B'X)'(B'X).  Calculation
of X'B'B'X results in Mata proforming three transposes and four matrix
multiplications.
....

In addition, use parenthesis where necessary to force the order of the
operations.  Concerning order of operations, Mata does not know what
you know, namely that rows >> cols.  Mata makes all of its
optimization decisions at compile time, and therefore can can choose
inefficient order of operations.  In this case, for memory
effinciency reasons, you do not want to calculate B'B to get to
X'B'B'X, you want to calculate X'B and B'X.  That right there will
make a differece in execution time, even without exploiting X'B
= (B'X)', which can be used additionally to save an entire multiplication.

- -- Bill
wgould@stata.com
*

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/