[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: puzzling benchmark results for MV probit
Kenneth Flamm <email@example.com>
st: puzzling benchmark results for MV probit
Wed, 01 Aug 2007 23:03:15 -0500
I recently put together a quad processor Core 2 Duo Q6600 machine for the express purpose of running multivariate probit Stata problems, and similar types of Stata code.
My thought was that I would modify and run the multivariate probit code making use of the MVNP plugin described by Capellari and Jenkins in 'Calculation of Multivariate Normal Probabilities by Simulation, with Applications to Maximum Simulated Likelihood Estimation'. The Stata Journal, 6(2), pp 156-189. In addition to being faster than the Capellari and Jenkins MVPROBIT routine, the ML code using MVNP looked relatively easy to modify in order to specify starting values for parameter estimates, rather than being required to use the results of initial single equation probits as starting values, as currently seems to be the case with MVPROBIT. (I basically have zero experience programming STATA subroutines, and am reluctant to try to learn enough to modify the ado code. The code in Capellari and Jenkins, 2006, looks much easier to make small modifications to, with STATA programming manual in hand.)
To get some idea of what I could expect, I ran the code for illustration 2 (in C&J, 2006), downloadable example test_mc_mvp3.do, on the following configurations of hardware:
a older dual core Athlon 64 X2 4200 running at 2.53 Ghz (modestly overclocked), running Stata 9 MP and Stata 10 MP, 2 processor versions. Scisoft Sandra memory benchmarks shows this machine having bandwidth of about 4.7-4.8Gb/sec, latency of 93ns.
a Quad Core Intel Q6600 running at 2.4 Ghz (stock speed), running Stata 10 MP 2 (using only 2 of the 4 cores), and Stata 10 MP 4 versions. Scisoft Sandra memory benchmarks shows this machine having bandwidth of about 5.8Gb/sec, latency of 83ns.
A homebrew Intel core 2 duo E4300 overclocked to 2.52 Ghz, only has single channel DDR memory, Scisoft Sandra memory benchmarks shows this machine having bandwidth of about 3.5Gb/sec, latency of 114ns.
all the above machines have 2MB total memory.
The timer built into the example code gives the following elapsed times:
Mdraws MV Probit MVPROBIT
250 antithetic draws by ML Code
Ath 64 X2
Stata 9 MP 2 cores 2.89 1434.39 2199.74
Stata 10 MP 2 cores 2.48 1401.97 2171.92
Intel Core 2 Duo Q6600
Stata 10 MP 4 cores .92 1806.88 1011.02
Stata 10 MP 2 cores 1.03 1806.02 1699.41
Intel Core 2 Duo E4300
Stata 10 MP 2 cores 1.14 1726.55 3790.0
The Athlon 64X2 seems to run the ML w/MVNP plugin significantly faster than either the either of the Intel Core 2 machines. Disappointingly, there seems to be no speedup at all going from 2 to 4 cores. The slightly faster run on the E4300 is probably related to the slightly faster (overclocked) clock rate on the E4300.
The Quad Core, with one or 2 cores, is faster than the Athlon running at a slightly higher clock rate on MVPROBIT. Going from 2 to 4 cores drops run times by about 40%. Memory bandwidth probably plays an important role in explaining performance on MVPROBIT; the e4300 takes more than double the time of 2 Q6600 processors running on a slightly slower clock . MVPROBIT actually runs faster than ML with the MVNP plugin on my quad core with either 2 or 4 cores. Not so on the E4300, which leads me to believe that a large cache size must be needed to enable MVPROBIT to run faster with more cores. Each of 2 pairs of cores on the Q6600 has 4MB cache, the e4300 has 2MB cache, the Athlon has 1MB (2x512K) total cache.
Mdraws has very slight speedup with more cores. The Athlon takes more than twice as long on this as either of the Intel machines, so this not being driven by fast or slow memory. My suspicion is that the cache size is driving Mdraws performance.
why is there no speedup with more cores with MVNP and ML? Why does my old Athlon run MVNP faster than my new Q6600? Is the MVNP plugin written or compiled in a manner that precludes it from making use of multiple threads on more than a single core? (If so, this is a significant drawback to the current version of the program.) Why does the Athlon do significantly better on the ML with MVNP version of the benchmark, but significantly worse with everything else? (Cache size?) (Is the Athlon floating point math better, when MVNP constrains the problem to run on only one processor?)
With a multicore machine with large cache and fast memory, it would appear that the older MVPROBIT is actually a faster method than ML/MVNP with plugin!
In any event, my plan of building a machine optimized to run these things faster clearly needs tuning. Any insights as to what is going on and what would be an optimal configuration for running this type of problem would be greatly appreciated.
University of Texas at Austin
* For searches and help try: