Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Weird problem with nested functions in Mata


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Weird problem with nested functions in Mata
Date   Mon, 17 Apr 2006 11:22:26 -0500

Ben Jann <ben.jann@soz.gess.ethz.ch> reports some suprising timings.
He runs the following, 

        ---test.do---------------------------
        set rmsg on
        mata:
        real colvector test1(x) return(x:*x:*x:*x:*x)
        real colvector test2(x) return(test1(x))
        real colvector test3(x) return(test2(x))
        real colvector test4(x) return(test3(x))
        real colvector test5(x) return(test4(x))
        real colvector test6(x) return(test5(x))
        end

        mata: x = uniform(10000,1)
        mata: for (i=1;i<=1000;i++) {; z = x:*x:*x:*x:*x; }
        mata: for (i=1;i<=1000;i++) {; z = test1(x); }
        mata: for (i=1;i<=1000;i++) {; z = test2(x); }
        mata: for (i=1;i<=1000;i++) {; z = test3(x); }
        mata: for (i=1;i<=1000;i++) {; z = test4(x); }
        mata: for (i=1;i<=1000;i++) {; z = test5(x); }
        mata: for (i=1;i<=1000;i++) {; z = test6(x); }
        --------------------------------------

on a number of different computers.  One some of them, he observed spikes in
run times of test1(), test3(), and test5().  On his Intel Centrino 1.3Ghz
running Windows XP SP2, the spike is 1.73 seconds (.00173 seconds per call),
and amounts to a 78% increase in run time.  All of the computers in which the
spikes were observed were running Windows except for one, which was running
Linux, and on that computer, the spike was observed in only one of the tests.

He summarizes, 

> So the problem somehow seems to depend on OS and/or CPU,
> although I cannot really make sense of it.

Alan Riley (ariley@strata.com) and I (wgould@stata.com) have now looked 
into this problem.  Our findings are, 

     0.  What Ben found turns out to be fascinating, but is not a big deal.

     1.  Despite appearances, the observed spikes are not a function of 
         nesting level.  They are a function of CPU+coprocessor State S.
         The time-cost of a nesting level we estimate to be .00000025
         seconds.

     2.  In the case of the results produced by Ben, S was a function 
         of nesting level N.

     3.  In the example provided, S(N) could improve things, or make them 
         worse, as N increased.  Moreover, even without changing nesting
         level, adding code could change the observed behavior.

     4.  The spikes observed by Ben occur only with :*, :+, etc. -- the colon 
         operators.  The spikes are not observed with, for instance, 
         matrix multiplication or with eigenvalue extraction.

     5.  We now have a good understanding of the issue (further explained
         below), and based on that, we believe that there are other functions
         in Mata that could exhibit the same behavior, although we do not 
         know which they are.

     6.  The behavior occurs only with certain 32-bit Intel computers.  It is 
         a hardware issue.  Our tests indicate that it does *NOT* occur 
         with modern 32-bit Intel computers.  It also does *NOT* occur 
         with 64-bit computers.

     7.  A change in the logic of the colon operators can eliminate the 
         behavior and even make the colon operators run faster!  However, that
         same change slows down 64-bit computers by a factor of 4, and
         probably slows down modern 32-bit computers, too, although we have
         not verified that yet.


So what is going on?
--------------------

Our first reaction to this report, made in an even earlier email by Ben, was
to dismiss it.  Modern computers are interrupted constantly, because of
network messages, because someone touches the mouse, because a housekeeping
project needs to run, etc., and we put the results down to that.  What 
caught our attention in Ben's most recent post, however, was the
reproducibility of it.  Ben keeps emphasizing that it is test3() and test5(),
and random interruptions do not explain that.

We did timings for the execution time cost of nesting level (reported below), 
and immediately dismissed that nesting level itself could be the cause.
Even if nesting tripled or quadrupled in time for some unknown reason, 
so little time is spent in subroutine linkage that it could not explain 
the relatively large timing differences that Ben reported.

As we mentioned above, we did find an older 32-bit computer, this one running
Linux, that exhibited the bevahior Ben reported.  When we ran Ben's tests
and substituted matrix multiplication, or eigen value extraction, the spikes 
vanished.

That lead us to looking in detail at the colon-operator code.  We discovered 
that if we made the slightest change to the code we could change the patterns 
of the spikes.  We could make spikes occur every third, every fourth, every 
fifth, and we could create other interesting patterns as well.  We identified 
the single line of the C code in the implementation of colon operators
which could cause all of the above.

That line of the code involves passing to a subroutine two double-precision
values, and we discovered that the slighest change to how those two values
were handled led to large changes in run times and patterns.  Basically, the
hardware was sometimes determining that the values were already on the
coprocessor stack (which they were), and sometimes losing that information,
and sometimes even getting more confused than that.  It all depended on what
was executed on the corprocessor previously.

We did lots of experiments and we will not report them all here.  The bottom
line is that we discoved that if we changed our call to pass the addresses of
the two double-precision values rather than the values themselves, the
colon-operator ran nearly twice as fast, and the pattern vanished!

We emphasize, this is true only on older 32-bit Intel hardware.  When we made
the same change on modern, 64-bit hardware, execution times quadrupled!

So there is no magic fix.  We still need to do experiments on modern 
32-bit CPUs, and non-Intel CPUs, to determine if, in all 32-bit cases, 
we can change our code and not harm anyone.  


Nesting level timings
---------------------

This is really a footnote.  Attached are two do-files, timing.do and an.do.
timing.do produces timings that are analyzed by an.do

We ran these do-files twice, once on a modern Linux computer, and again on 
a modern Windows computer.

To summarize our results, a nesting level costs about .25*10^-6 seconds.
Said differently, a nesting level requires .00000025 sec.  Said differently,
you would need to nest 4-million deep to cause an increase of 1 second in 
execution time.  

That's on our computers, which are about average:  Intel 2.8 gigahertz P4s.

A run of the analysis is provided below.

-- Bill                      -- Alan 
   wgould@stata.com             ariley@stata.com


Attachements:

     1.  Run of an.do
     2.  timing.do
     3.  an.do

==============================================================================
Attachment 1:  Run of an.do

-------------------------------------------------------------------------------
       log:  /home/wwg/email/an.log
  log type:  text
 opened on:  14 Apr 2006, 10:10:25
r; t=0.00 10:10:25

. 
. program addmeans
  1.         syntax varlist 
  2.         set obs 4
  3.         foreach name of local varlist {
  4.                 qui sum `name'
  5.                 qui replace `name' = r(mean) in l
  6.         }
  7. end
r; t=0.00 10:10:25

. 
. 
. 
. infile using linux, clear 

dictionary {
        double top0
        double top1
        double top2
        double top3
        double top4
        double top5
        double top6
}

(3 observations read)
r; t=0.00 10:10:25

. format %9.2f top0-top6
r; t=0.00 10:10:25

. addmeans _all
obs was 3, now 4
r; t=0.00 10:10:25

. gen str os = "linux"
r; t=0.00 10:10:25

. order os
r; t=0.00 10:10:25

. list

     +------------------------------------------------------------+
     |    os   top0   top1   top2    top3    top4    top5    top6 |
     |------------------------------------------------------------|
  1. | linux   1.91   5.00   7.78   10.34   12.84   15.38   17.92 |
  2. | linux   1.87   4.91   7.71   10.34   12.79   15.32   17.85 |
  3. | linux   1.90   4.88   7.73   10.35   12.85   15.35   17.92 |
  4. | linux   1.89   4.93   7.74   10.34   12.83   15.35   17.90 |
     +------------------------------------------------------------+
r; t=0.00 10:10:25

. save linux, replace 
file linux.dta saved
r; t=0.00 10:10:25

. keep in l
(3 observations deleted)
r; t=0.00 10:10:25

. save avg_linux, replace
file avg_linux.dta saved
r; t=0.00 10:10:25

. 
. 
. infile using windows, clear

dictionary {
        double top0
        double top1
        double top2
        double top3
        double top4
        double top5
        double top6
}

(3 observations read)
r; t=0.00 10:10:25

. format %9.2f top0-top6
r; t=0.00 10:10:25

. addmeans _all
obs was 3, now 4
r; t=0.00 10:10:25

. gen str os = "win"
r; t=0.00 10:10:25

. order os 
r; t=0.00 10:10:25

. list

     +---------------------------------------------------------+
     |  os   top0   top1   top2   top3    top4    top5    top6 |
     |---------------------------------------------------------|
  1. | win   2.05   4.16   7.13   9.56   11.69   13.86   16.09 |
  2. | win   1.98   4.17   7.17   9.67   11.67   13.84   16.08 |
  3. | win   2.02   4.14   7.08   9.61   11.69   13.83   16.08 |
  4. | win   2.02   4.16   7.13   9.61   11.68   13.84   16.08 |
     +---------------------------------------------------------+
r; t=0.00 10:10:25

. save win, replace 
file win.dta saved
r; t=0.00 10:10:25

. keep in l 
(3 observations deleted)
r; t=0.00 10:10:25

. save avg_win, replace
file avg_win.dta saved
r; t=0.00 10:10:25

. 
. append using avg_linux
os was str3 now str5
r; t=0.00 10:10:25

. sort os 
r; t=0.00 10:10:25

. list

     +------------------------------------------------------------+
     |    os   top0   top1   top2    top3    top4    top5    top6 |
     |------------------------------------------------------------|
  1. | linux   1.89   4.93   7.74   10.34   12.83   15.35   17.90 |
  2. |   win   2.02   4.16   7.13    9.61   11.68   13.84   16.08 |
     +------------------------------------------------------------+
r; t=0.00 10:10:25

. 
. gen double t1 = ((top1-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t2 = ((top2-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t3 = ((top3-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t4 = ((top4-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t5 = ((top5-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t6 = ((top6-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. format t1-t6 %9.3f
r; t=0.00 10:10:25

. list os t1-t6

     +-------------------------------------------------------+
     |    os      t1      t2      t3      t4      t5      t6 |
     |-------------------------------------------------------|
  1. | linux   0.304   0.585   0.845   1.093   1.346   1.600 |
  2. |   win   0.214   0.511   0.760   0.967   1.183   1.407 |
     +-------------------------------------------------------+
r; t=0.00 10:10:25

. gen double dt2 = ((top2-top1)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt3 = ((top3-top2)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt4 = ((top4-top3)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt5 = ((top5-top4)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt6 = ((top6-top5)/1e+7)*1e+6
r; t=0.00 10:10:25

. format dt2-dt6 %9.3f
r; t=0.00 10:10:25

. list os t1 dt2-dt6

     +-------------------------------------------------------+
     |    os      t1     dt2     dt3     dt4     dt5     dt6 |
     |-------------------------------------------------------|
  1. | linux   0.304   0.281   0.260   0.248   0.252   0.255 |
  2. |   win   0.214   0.297   0.249   0.207   0.216   0.224 |
     +-------------------------------------------------------+
r; t=0.00 10:10:25

. log close
       log:  /home/wwg/email/an.log
  log type:  text
 closed on:  14 Apr 2006, 10:10:25
-------------------------------------------------------------------------------

==============================================================================
Attachment 2:  timing.do

cscript

log using junk1.log, replace
set more off
set rmsg on 


mata:
mata:
real colvector test1(x) return(x)
real colvector test2(x) return(test1(x))
real colvector test3(x) return(test2(x))
real colvector test4(x) return(test3(x))
real colvector test5(x) return(test4(x))
real colvector test6(x) return(test5(x))

void top0()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                /* do nothing */
        }
}

void top1()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test1(x)
        }
}


void top2()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test2(x)
        }
}


void top3()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test3(x)
        }
}

void top4()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test4(x)
        }
}

void top5()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test5(x)
        }
}

void top6()
{
        real scalar        i
        real colvector        x

        x = (1)
        for (i=1; i<=10000000; i++) {
                (void) test6(x)
        }
}

end
mata:  top0()
mata:  top0()
mata:  top0()
mata:  top1()
mata:  top1()
mata:  top1()
mata:  top2()
mata:  top2()
mata:  top2()
mata:  top3()
mata:  top3()
mata:  top3()
mata:  top4()
mata:  top4()
mata:  top4()
mata:  top5()
mata:  top5()
mata:  top5()
mata:  top6()
mata:  top6()
mata:  top6()
log close

==============================================================================
Attachment 3:  an.do
(Note, to run this, you must type in the timings from timing.do
For the Linux system, they were

        dictionary {
                double top0
                double top1
                double top2
                double top3
                double top4
                double top5
                double top6
        }
        1.91 5.00 7.78 10.34 12.84 15.38 17.92
        1.87 4.91 7.71 10.34 12.79 15.32 17.85
        1.90 4.88 7.73 10.35 12.85 15.35 17.92

and for the Windows system, they were

        dictionary {
                double top0
                double top1
                double top2
                double top3
                double top4
                double top5
                double top6
        }
        2.05 4.16 7.13  9.56 11.69 13.86 16.09
        1.98 4.17 7.17  9.67 11.67 13.84 16.08
        2.02 4.14 7.08  9.61 11.69 13.83 16.08

The do file is:

clear
program drop _all

log using an.log, replace

program addmeans
        syntax varlist 
        set obs 4
        foreach name of local varlist {
                qui sum `name'
                qui replace `name' = r(mean) in l
        }
end




infile using linux, clear 
format %9.2f top0-top6
addmeans _all
gen str os = "linux"
order os
list
save linux, replace 
keep in l
save avg_linux, replace


infile using windows, clear
format %9.2f top0-top6
addmeans _all
gen str os = "win"
order os 
list
save win, replace 
keep in l 
save avg_win, replace

append using avg_linux
sort os 
list

gen double t1 = ((top1-top0)/1e+7)*1e+6
gen double t2 = ((top2-top0)/1e+7)*1e+6
gen double t3 = ((top3-top0)/1e+7)*1e+6
gen double t4 = ((top4-top0)/1e+7)*1e+6
gen double t5 = ((top5-top0)/1e+7)*1e+6
gen double t6 = ((top6-top0)/1e+7)*1e+6
format t1-t6 %9.3f
list os t1-t6
gen double dt2 = ((top2-top1)/1e+7)*1e+6
gen double dt3 = ((top3-top2)/1e+7)*1e+6
gen double dt4 = ((top4-top3)/1e+7)*1e+6
gen double dt5 = ((top5-top4)/1e+7)*1e+6
gen double dt6 = ((top6-top5)/1e+7)*1e+6
format dt2-dt6 %9.3f
list os t1 dt2-dt6
log close
==============================================================================
<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index