# Re: st: Weird problem with nested functions in Mata

 From [email protected] (William Gould, Stata) To [email protected] Subject Re: st: Weird problem with nested functions in Mata Date Mon, 17 Apr 2006 11:22:26 -0500

```Ben Jann <[email protected]> reports some suprising timings.
He runs the following,

---test.do---------------------------
set rmsg on
mata:
real colvector test1(x) return(x:*x:*x:*x:*x)
real colvector test2(x) return(test1(x))
real colvector test3(x) return(test2(x))
real colvector test4(x) return(test3(x))
real colvector test5(x) return(test4(x))
real colvector test6(x) return(test5(x))
end

mata: x = uniform(10000,1)
mata: for (i=1;i<=1000;i++) {; z = x:*x:*x:*x:*x; }
mata: for (i=1;i<=1000;i++) {; z = test1(x); }
mata: for (i=1;i<=1000;i++) {; z = test2(x); }
mata: for (i=1;i<=1000;i++) {; z = test3(x); }
mata: for (i=1;i<=1000;i++) {; z = test4(x); }
mata: for (i=1;i<=1000;i++) {; z = test5(x); }
mata: for (i=1;i<=1000;i++) {; z = test6(x); }
--------------------------------------

on a number of different computers.  One some of them, he observed spikes in
run times of test1(), test3(), and test5().  On his Intel Centrino 1.3Ghz
running Windows XP SP2, the spike is 1.73 seconds (.00173 seconds per call),
and amounts to a 78% increase in run time.  All of the computers in which the
spikes were observed were running Windows except for one, which was running
Linux, and on that computer, the spike was observed in only one of the tests.

He summarizes,

> So the problem somehow seems to depend on OS and/or CPU,
> although I cannot really make sense of it.

Alan Riley ([email protected]) and I ([email protected]) have now looked
into this problem.  Our findings are,

0.  What Ben found turns out to be fascinating, but is not a big deal.

1.  Despite appearances, the observed spikes are not a function of
nesting level.  They are a function of CPU+coprocessor State S.
The time-cost of a nesting level we estimate to be .00000025
seconds.

2.  In the case of the results produced by Ben, S was a function
of nesting level N.

3.  In the example provided, S(N) could improve things, or make them
worse, as N increased.  Moreover, even without changing nesting
level, adding code could change the observed behavior.

4.  The spikes observed by Ben occur only with :*, :+, etc. -- the colon
operators.  The spikes are not observed with, for instance,
matrix multiplication or with eigenvalue extraction.

5.  We now have a good understanding of the issue (further explained
below), and based on that, we believe that there are other functions
in Mata that could exhibit the same behavior, although we do not
know which they are.

6.  The behavior occurs only with certain 32-bit Intel computers.  It is
a hardware issue.  Our tests indicate that it does *NOT* occur
with modern 32-bit Intel computers.  It also does *NOT* occur
with 64-bit computers.

7.  A change in the logic of the colon operators can eliminate the
behavior and even make the colon operators run faster!  However, that
same change slows down 64-bit computers by a factor of 4, and
probably slows down modern 32-bit computers, too, although we have
not verified that yet.

So what is going on?
--------------------

Our first reaction to this report, made in an even earlier email by Ben, was
to dismiss it.  Modern computers are interrupted constantly, because of
network messages, because someone touches the mouse, because a housekeeping
project needs to run, etc., and we put the results down to that.  What
caught our attention in Ben's most recent post, however, was the
reproducibility of it.  Ben keeps emphasizing that it is test3() and test5(),
and random interruptions do not explain that.

We did timings for the execution time cost of nesting level (reported below),
and immediately dismissed that nesting level itself could be the cause.
Even if nesting tripled or quadrupled in time for some unknown reason,
so little time is spent in subroutine linkage that it could not explain
the relatively large timing differences that Ben reported.

As we mentioned above, we did find an older 32-bit computer, this one running
Linux, that exhibited the bevahior Ben reported.  When we ran Ben's tests
and substituted matrix multiplication, or eigen value extraction, the spikes
vanished.

That lead us to looking in detail at the colon-operator code.  We discovered
that if we made the slightest change to the code we could change the patterns
of the spikes.  We could make spikes occur every third, every fourth, every
fifth, and we could create other interesting patterns as well.  We identified
the single line of the C code in the implementation of colon operators
which could cause all of the above.

That line of the code involves passing to a subroutine two double-precision
values, and we discovered that the slighest change to how those two values
were handled led to large changes in run times and patterns.  Basically, the
hardware was sometimes determining that the values were already on the
coprocessor stack (which they were), and sometimes losing that information,
and sometimes even getting more confused than that.  It all depended on what
was executed on the corprocessor previously.

We did lots of experiments and we will not report them all here.  The bottom
line is that we discoved that if we changed our call to pass the addresses of
the two double-precision values rather than the values themselves, the
colon-operator ran nearly twice as fast, and the pattern vanished!

We emphasize, this is true only on older 32-bit Intel hardware.  When we made
the same change on modern, 64-bit hardware, execution times quadrupled!

So there is no magic fix.  We still need to do experiments on modern
32-bit CPUs, and non-Intel CPUs, to determine if, in all 32-bit cases,
we can change our code and not harm anyone.

Nesting level timings
---------------------

This is really a footnote.  Attached are two do-files, timing.do and an.do.
timing.do produces timings that are analyzed by an.do

We ran these do-files twice, once on a modern Linux computer, and again on
a modern Windows computer.

To summarize our results, a nesting level costs about .25*10^-6 seconds.
Said differently, a nesting level requires .00000025 sec.  Said differently,
you would need to nest 4-million deep to cause an increase of 1 second in
execution time.

That's on our computers, which are about average:  Intel 2.8 gigahertz P4s.

A run of the analysis is provided below.

-- Bill                      -- Alan
[email protected]             [email protected]

Attachements:

1.  Run of an.do
2.  timing.do
3.  an.do

==============================================================================
Attachment 1:  Run of an.do

-------------------------------------------------------------------------------
log:  /home/wwg/email/an.log
log type:  text
opened on:  14 Apr 2006, 10:10:25
r; t=0.00 10:10:25

.
1.         syntax varlist
2.         set obs 4
3.         foreach name of local varlist {
4.                 qui sum `name'
5.                 qui replace `name' = r(mean) in l
6.         }
7. end
r; t=0.00 10:10:25

.
.
.
. infile using linux, clear

dictionary {
double top0
double top1
double top2
double top3
double top4
double top5
double top6
}

r; t=0.00 10:10:25

. format %9.2f top0-top6
r; t=0.00 10:10:25

obs was 3, now 4
r; t=0.00 10:10:25

. gen str os = "linux"
r; t=0.00 10:10:25

. order os
r; t=0.00 10:10:25

. list

+------------------------------------------------------------+
|    os   top0   top1   top2    top3    top4    top5    top6 |
|------------------------------------------------------------|
1. | linux   1.91   5.00   7.78   10.34   12.84   15.38   17.92 |
2. | linux   1.87   4.91   7.71   10.34   12.79   15.32   17.85 |
3. | linux   1.90   4.88   7.73   10.35   12.85   15.35   17.92 |
4. | linux   1.89   4.93   7.74   10.34   12.83   15.35   17.90 |
+------------------------------------------------------------+
r; t=0.00 10:10:25

. save linux, replace
file linux.dta saved
r; t=0.00 10:10:25

. keep in l
(3 observations deleted)
r; t=0.00 10:10:25

. save avg_linux, replace
file avg_linux.dta saved
r; t=0.00 10:10:25

.
.
. infile using windows, clear

dictionary {
double top0
double top1
double top2
double top3
double top4
double top5
double top6
}

r; t=0.00 10:10:25

. format %9.2f top0-top6
r; t=0.00 10:10:25

obs was 3, now 4
r; t=0.00 10:10:25

. gen str os = "win"
r; t=0.00 10:10:25

. order os
r; t=0.00 10:10:25

. list

+---------------------------------------------------------+
|  os   top0   top1   top2   top3    top4    top5    top6 |
|---------------------------------------------------------|
1. | win   2.05   4.16   7.13   9.56   11.69   13.86   16.09 |
2. | win   1.98   4.17   7.17   9.67   11.67   13.84   16.08 |
3. | win   2.02   4.14   7.08   9.61   11.69   13.83   16.08 |
4. | win   2.02   4.16   7.13   9.61   11.68   13.84   16.08 |
+---------------------------------------------------------+
r; t=0.00 10:10:25

. save win, replace
file win.dta saved
r; t=0.00 10:10:25

. keep in l
(3 observations deleted)
r; t=0.00 10:10:25

. save avg_win, replace
file avg_win.dta saved
r; t=0.00 10:10:25

.
. append using avg_linux
os was str3 now str5
r; t=0.00 10:10:25

. sort os
r; t=0.00 10:10:25

. list

+------------------------------------------------------------+
|    os   top0   top1   top2    top3    top4    top5    top6 |
|------------------------------------------------------------|
1. | linux   1.89   4.93   7.74   10.34   12.83   15.35   17.90 |
2. |   win   2.02   4.16   7.13    9.61   11.68   13.84   16.08 |
+------------------------------------------------------------+
r; t=0.00 10:10:25

.
. gen double t1 = ((top1-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t2 = ((top2-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t3 = ((top3-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t4 = ((top4-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t5 = ((top5-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double t6 = ((top6-top0)/1e+7)*1e+6
r; t=0.00 10:10:25

. format t1-t6 %9.3f
r; t=0.00 10:10:25

. list os t1-t6

+-------------------------------------------------------+
|    os      t1      t2      t3      t4      t5      t6 |
|-------------------------------------------------------|
1. | linux   0.304   0.585   0.845   1.093   1.346   1.600 |
2. |   win   0.214   0.511   0.760   0.967   1.183   1.407 |
+-------------------------------------------------------+
r; t=0.00 10:10:25

. gen double dt2 = ((top2-top1)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt3 = ((top3-top2)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt4 = ((top4-top3)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt5 = ((top5-top4)/1e+7)*1e+6
r; t=0.00 10:10:25

. gen double dt6 = ((top6-top5)/1e+7)*1e+6
r; t=0.00 10:10:25

. format dt2-dt6 %9.3f
r; t=0.00 10:10:25

. list os t1 dt2-dt6

+-------------------------------------------------------+
|    os      t1     dt2     dt3     dt4     dt5     dt6 |
|-------------------------------------------------------|
1. | linux   0.304   0.281   0.260   0.248   0.252   0.255 |
2. |   win   0.214   0.297   0.249   0.207   0.216   0.224 |
+-------------------------------------------------------+
r; t=0.00 10:10:25

. log close
log:  /home/wwg/email/an.log
log type:  text
closed on:  14 Apr 2006, 10:10:25
-------------------------------------------------------------------------------

==============================================================================
Attachment 2:  timing.do

cscript

log using junk1.log, replace
set more off
set rmsg on

mata:
mata:
real colvector test1(x) return(x)
real colvector test2(x) return(test1(x))
real colvector test3(x) return(test2(x))
real colvector test4(x) return(test3(x))
real colvector test5(x) return(test4(x))
real colvector test6(x) return(test5(x))

void top0()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
/* do nothing */
}
}

void top1()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test1(x)
}
}

void top2()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test2(x)
}
}

void top3()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test3(x)
}
}

void top4()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test4(x)
}
}

void top5()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test5(x)
}
}

void top6()
{
real scalar        i
real colvector        x

x = (1)
for (i=1; i<=10000000; i++) {
(void) test6(x)
}
}

end
mata:  top0()
mata:  top0()
mata:  top0()
mata:  top1()
mata:  top1()
mata:  top1()
mata:  top2()
mata:  top2()
mata:  top2()
mata:  top3()
mata:  top3()
mata:  top3()
mata:  top4()
mata:  top4()
mata:  top4()
mata:  top5()
mata:  top5()
mata:  top5()
mata:  top6()
mata:  top6()
mata:  top6()
log close

==============================================================================
Attachment 3:  an.do
(Note, to run this, you must type in the timings from timing.do
For the Linux system, they were

dictionary {
double top0
double top1
double top2
double top3
double top4
double top5
double top6
}
1.91 5.00 7.78 10.34 12.84 15.38 17.92
1.87 4.91 7.71 10.34 12.79 15.32 17.85
1.90 4.88 7.73 10.35 12.85 15.35 17.92

and for the Windows system, they were

dictionary {
double top0
double top1
double top2
double top3
double top4
double top5
double top6
}
2.05 4.16 7.13  9.56 11.69 13.86 16.09
1.98 4.17 7.17  9.67 11.67 13.84 16.08
2.02 4.14 7.08  9.61 11.69 13.83 16.08

The do file is:

clear
program drop _all

log using an.log, replace

syntax varlist
set obs 4
foreach name of local varlist {
qui sum `name'
qui replace `name' = r(mean) in l
}
end

infile using linux, clear
format %9.2f top0-top6
gen str os = "linux"
order os
list
save linux, replace
keep in l
save avg_linux, replace

infile using windows, clear
format %9.2f top0-top6
gen str os = "win"
order os
list
save win, replace
keep in l
save avg_win, replace

append using avg_linux
sort os
list

gen double t1 = ((top1-top0)/1e+7)*1e+6
gen double t2 = ((top2-top0)/1e+7)*1e+6
gen double t3 = ((top3-top0)/1e+7)*1e+6
gen double t4 = ((top4-top0)/1e+7)*1e+6
gen double t5 = ((top5-top0)/1e+7)*1e+6
gen double t6 = ((top6-top0)/1e+7)*1e+6
format t1-t6 %9.3f
list os t1-t6
gen double dt2 = ((top2-top1)/1e+7)*1e+6
gen double dt3 = ((top3-top2)/1e+7)*1e+6
gen double dt4 = ((top4-top3)/1e+7)*1e+6
gen double dt5 = ((top5-top4)/1e+7)*1e+6
gen double dt6 = ((top6-top5)/1e+7)*1e+6
format dt2-dt6 %9.3f
list os t1 dt2-dt6
log close
==============================================================================
<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```

• Follow-Ups: