# Re: RE: st: Efficiency of the colon operator in Mata

 From wgould@stata.com (William Gould, Stata) To statalist@hsphsun2.harvard.edu Subject Re: RE: st: Efficiency of the colon operator in Mata Date Tue, 23 May 2006 09:24:16 -0500

```I responded too quickly to Ben Jann's <ben.jann@soz.gess.ethz.ch> question
about the performance of colon-multiplication vs. multiplication.
Alax Ogan <aogan@ArrowstreetCapital.com> reports,

>       . mata: x = uniform(10000,1)
>       r; t=0.00 16:21:51
>
>       . mata: for (i=1;i<=10000;i++) y = x*1
>       r; t=1.47 16:21:53
>
>       . mata: for (i=1;i<=10000;i++) y = x:*1
>       r; t=3.33 16:21:57

and I, too, have reproduced similar results on my computer.

I now believe that colon operators could be sped up -- perhaps by 50 percent --
if we did a lot of work.  Let me explain.

We did go to a lot of work with matrix multiplication.  It is not so much that
colon operators are slow, but that they are being compared to matrix
multiplicaiton, which we have made especially fast.

How one codes matrix multiplication was a real education for me.  One thinks
of matrix multiplication as being pretty simple.  Given A: k x m and B: m x n,
one forms C = A*B by

m
A[i,j] =   Sum A[i,w]*B[w,j] ,     i = 1, ..., k and j = 1, ..., n
w=1

Or, in computer speak

for (i=1; i<=k; i++) {
for (j=1; j<=n; j++) {
C[i,j] = 0
for (w=1; w<=m; w++) {
C[i,j] = C[i,j] + A[i,w]*B[w,j]
}
}
}

However, I could just as well reverse the i and j loops:

for (j=1; j<=n; j++) {
for (i=1; i<=k; i++) {
C[i,j] = 0
for (w=1; w<=m; w++) {
C[i,j] = C[i,j] + A[i,w]*A[w,j]
}
}
}

In fact, there are 6 possible ways I could have written the code,

1.  loop over  i, then j, then k
2.  loop over  i, then k, then j
3.  loop over  j, then i, then k
4.  loop over  j, then k, then i
5.  loop over  k, then i, then j
6.  loop over  k, then j, then i

What does it matter, you say to yourself, they are all mathematically
equivalent.

It turns out that with today's modern computers, how you organize the loop
affects what the floating-point coprocessor will be able to keep in its
registers from one loop to the next, and it matters because it affects what
the CPU will be able to keep in cache.

So which organization is best?  Answer:  it depends.  If you really want to
write a fast matrix multiplier, you need to write separate routines for lots
of special cases, such as k>m vs. k<m, m>n vs. m<n, and so on.  In each case,
there is a different best solution.

As I said, there are 6 solutions.  Then, if you really want to be fast, you
need to write six versions of the code for (1) a real times a real, for a
complex times a complex, for (3) a complex times a real, for (4) a real times
a complex, for (5) a view times a view, and on and on.

We didn't write all of them, but we wrote a lot of them, and as a result,

We followed a different programming strategy for the colon operators:
General-purpose code, easy to read, easy to understand, easy to certify,
easy to maintain.  We did that for two reasons:

1.  There are lots of colon operators.

2.  Colon operators are, compared to matrix multiplication, a
light-weight operation.  When you sum total run times, they
usually will not amount to much.

Point (2) is worth emphasizing.  Consider multiplying A: k x m and B: m x n.
One performs k*n*m operations in obtaining the final result.  To make results
for multiplication conformable with colon-multiplication, let's assume k==m.
Matrix multiplication requires (k^2)*m operations.  Colon-multiplication
requires only k*m operations.

There are 7 indexing cases for colon operators:

A             B
----------------------
r x c         r x c

r x 1         r x c
1 x c         r x c
1 x 1         r x c

r x c         r x 1
r x c         1 x c
r x c         1 x 1
----------------------

We used one routine for all of them.

In addition, another we we simplified the problem was to use to use the same
code for all 13 of the colon operators, and called a subroutine in the
innermost loop.

Thus, to improve performance, we need either 7 specialized routines, or
we need 7*13 = 91.

We could hope for a 56 percent improvement in the A: 1 x 1 case.

Still, the absolute difference will be small.  In Olag's case, the
excess time was ((3.33-1.47)/10000)/10000 = 1.86e-08 seconds per element.  I
don't know about Ben's real application, but even if he were doing the
operation on 100,000-element vectors 300 times, the excess amount only to
.558 seconds.

We will look into it.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```