Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Mata slow?


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Mata slow?
Date   Tue, 28 Nov 2006 09:38:31 -0600

Brendan Halpin <brendan.halpin@ul.ie> explained that the Mata code we have
been discussing is to be called _N*(_N-1)/2 times, so speed is important.

To refresh your memories, Brendan wrote code that ran in .03906 seconds, 
which he noted was about 40 times slower than a plugin he wrote in C.
I worked on the code on got the execution time down to .0228 seconds, 
or about 22 times slower than the plugin.

Motivated by Brendan's latest comment, I have now improved the program so that
it runs in .003 seconds.  That's just 2 times slower than the plugin, and we
have reduced the run time to just 7.5% of the original.

Here's my latest version (with timing and replication):


    --------------------------------------------------------------- t.do ---
    clear   

    local N 100

    mata:

    void tryit()
    {
            timer_clear()                 // clear timers

            A = B = C = D = J(`N'+1, `N'+1, 0)
            s1 = s2 = J(`N', 1, 0)

            a = J(3, `N', .)

            for (k=1; k<=50; k++) {
                    timer_on(1)           // start timer 1
                    /* --------------------------------------- CODE --- */
                    n1 = length(s1) + 1
                    n2 = length(s2) + 1

                    J = (2\n2)
                    J1= J :- 1

                    for (i=2; i<=n1; i++) {
                            I     = (i\i)
                            I1    = I :- 1

                             I_J  = (I, J)
                             I_J1 = (I, J1)
                            I1_J  = (I1, J)
                            I1_J1 = (I1, J1)

                            a[1,] = D[|  I_J1 |] + A[| I_J |]
                            a[2,] = D[| I1_J  |] + B[| I_J |]
                            a[3,] = D[| I1_J1 |] + C[| I_J |]

                            D[| I_J |] = colmin(a)
                    }
                    /* --------------------------------------- CODE --- */
                    timer_off(1)         // stop timer 1
            }
            timer(1)                     // report timer 1
    }

    tryit()
    end
    --------------------------------------------------------------- t.do ---

What I did was substitute vector operations for Brendan's inner loop.
Brendan's inner loop went across j=2 to n2.  That corresponds to J in 
the code above.  Brendan needed to refer to j1 = j-1; that's J1 in the 
latest code.

Brendan's inner loop was to set 

         D[i,j] = min( D[i, j1] + A[i,j], 
                       D[i1,j ] + B[i,j], 
                       D[i1,j1] + C[i,j] )

for i=2 to n1 and j=2 to n2 (and where i1 = i-1 and j1 = j-1).

In my code above, 

        1st row of a    corresponds to    D[i, j1] + A[i,j]   (for j=2 to n2)
        2nd row of a    corresponds to    D[i1,j ] + B[i,j]   (for j=2 to n2)
        3rd row of a    corresponds to    D[i1,j1] + C[i,j]   (for j=2 to n2)

for a given i.  Thus, 

        D[i,j]  =  colmin(a)   (for j=2 to n2, and for a given i)

I used range subscripts; see -help m2 subscripts-.  I could not just refer to
D[i,], A[i,], etc. because j was not going over all values of j.  j goes from
2 to n2.  And notice that I did *NOT* code D[i, (2..n2)], etc.  The use of
list subscripts in such cases is slower than range subscripts because (2..n2)
requires the vector (2, 3, ..., n2) be allocated and filled in, and remember,
Brendan told us these matrices could be large.

In the latest code, you are probably wondering why I predefined

            a = J(3, `N', .)

and then later coded, 

            a[1,] = D[|  I_J1 |] + A[| I_J |]
            a[2,] = D[| I1_J  |] + B[| I_J |]
            a[3,] = D[| I1_J1 |] + C[| I_J |]

Rather than simply coding 

            a = ( D[|  I_J1 |] + A[| I_J |] \
                  D[| I1_J  |] + B[| I_J |] \
                  D[| I1_J1 |] + C[| I_J |]      )

The answer is speed.  Brendan told us that these matrices (and therefore
vectors) are potentially large and, in such cases, it is faster to preallocate
and fill in rather than construct on the fly; see "Warning about the 
misuse of the comma and backslash operators" in -help m2 op_join-.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index