The following answer is based on an exchange that started on
Statalist.
What kind of performance increase can I expect in going from 32-bit to
64-bit Stata?
| Title |
|
Difference in performance between 32-bit Stata and 64-bit Stata |
| Author |
Kerry Kammire, StataCorp
Bill Gould, StataCorp |
| Date |
February 2006 |
We know that going from a 1-GHz CPU to a 2-GHz CPU will give us roughly
twice the speed. So why doesn’t going from a 32-bit CPU to a 64-bit CPU
give us the same performance boost?
Bill Gould posted the following explanation to Statalist:
There have recently been a number of postings about the speed of 32-bit
Stata and 64-bit Stata, summarized by Alan Riley:
32-bit 64-bit
-----------------------------------------
sort 2.00 1.64 seconds
regress .88 .80
-----------------------------------------
timings from 1.4 GHz AMD Opteron 240
running 64-bit Windows.
Some of you are probably wondering why the numbers in the second column are
not 1.00 and .44. After all, 64 is twice 32, so shouldn’t the run
times be half? Or at least approximately half?
We all know how to interpret clock-cycle speeds. A 1.4-GHz computer runs
twice as fast as a 0.7-GHz computer. Why don’t bit widths work this
way?
Computer width
Microprocessors have grown from being 4 bit, to 8, to 16, to 32, and now, to
64. What does this mean?
One way to picture a computer is as a machine with gears and a crank
sticking out of the side:
+--== <- crank
+---------------------+ |
| || || || || | |
shaft -> +---------------------+----+
| || || || || |
+---------------------+
||
The -- is one gear on a shaft.
||
The picture above illustrates a 4-bit computer.
The number of gears corresponds to the width of the computer. The rate at
which you turn the crank corresponds to the clock speed. (On a 1.4-GHz
computer, you turn the crank 1.4 billion times per second.)
On this 4-bit computer, every turn of the crank does something to 4 bits. If
we doubled the number of gears—made an 8-bit computer—every turn
of the crank would do something to 8 bits.
Completing the picture
Our picture is not yet complete because we must add memory, so here is the
completed picture:
<--- computer, which slides left and right ---->
+--==
+---------------------+ |
| || || || || | |
+---------------------+----+ (cpu)
|| || || ||
=============================================================
... || || || || || || || || || || ...
-------------------------------------------------------- (memory)
... || || || || || || || || || || ...
------------------------------------------------------------------
In this picture, the computer slides left and right on a rail (illustrated
by ====). Below the rail is memory, another set of gears, with which the
gears on the computer mesh.
Putting the computer to work
Let us say we want to do a calculation on a 4-bit number.
To do this, we line up the CPU with appropriate position in memory—the
position that contains the 4-bit number on which we want to
operate—and we turn the crank.
Working with numbers longer than 4 bits
Let us say we want to do a calculation on an 8-bit number. With our
4-bit computer, the following is a slight oversimplification, but not by
much: we line up our computer on the first 4 bits of the 8-bit number, turn
the crank, shift our computer right 4 bits, and turn the crank again.
This example leads to the following general rule:
R1. If we have a k-bit–wide computer and performing an operation
on a k-bit number requires n turns of the crank, then
performing the operation on an m > k bit number requires
(m/k)*n turns of the crank, plus m/k realignments.
R1 applies only when m/k is an integer ≥1.
Working with numbers shorter than 4 bits
Let us say we want to do a calculation on a 2-bit number. With a real
computer, such problems arise often. On a 32-bit computer, we may want to
make a calculation on 8-bit values (i.e., 1-byte values, or characters) or
16-bit values (i.e., 2-byte values, or short integers).
On a 64-bit computer, we may want to make calculations on 8-bit values,
16-bit values, or 32-bit values (i.e., 4-byte values, or integers).
However, our example computer cannot work on values shorter than 4 bits
because it has four gears and each meshes with memory.
We need to take a detour and talk about alignment.
Alignment
I drew my illustration pretty carefully. Look again:
+--==
+---------------------+ |
| || || || || | |
+---------------------+----+
|| || || ||
=============================================================
... || || || || || || || || || || ...
--------------------------------------------------------
... || || || || || || || || || || ...
--------------------------------------------------------------
... 2 3 4 5 6 7 8 9 10 11 ...
Below the illustration, I have numbered the gears. Think of those as memory
positions.
The memory occurs in 4-bit groups on this 4-bit computer.
As our computer slides left and right along memory, it can align only at
certain places. We can line up our CPU on gear 4, or gear 8, etc., but not
on, say, gear 3:
+--==
+---------------------+ |
| || || || || | |
+---------------------+----+
|| || || ||
=============================================================
... || || || || || || || || || || ...
--------------------------------------------------------
... || || || || || || || || || || ...
---------------------------------------------------------------
... 2 3 4 5 6 7 8 9 10 11 ...
Above I aligned the leftmost gear of our CPU with gear 3 of memory, and now
the remaining gears do not mesh.
Modern computers are like that. It is a rule of construction.
Back to working with numbers shorter than 4 bits
Let us draw just the memory part of our computer:
=============================================================
... || || || || || || || || || || ...
--------------------------------------------------------
... || || || || || || || || || || ...
---------------------------------------------------------------
... 2 3 4 5 6 7 8 9 10 11 ...
Remember, we are going to make a 2-bit calculation. The two bits, without
loss of generality, could be at (3,4), (4,5), (6,7), or (7,8).
Actually, because of the alignment problem I described above, positions
(3,4), (5,6), and (7,8) are not possible. Our computer cannot do that.
That leaves (4,5) and (6,7).
Let us consider the two cases.
The bits are at gears (6,7)
We align our CPU on bit 4, so that we have
+--==
+---------------------+ |
| || || || || | |
+---------------------+----+
|| || || ||
=============================================================
... || || || || || || || || || || ...
--------------------------------------------------------
... || || || || || || || || || || ...
---------------------------------------------------------------
... 2 3 4 5 6 7 8 9 10 11 ...
Then we do the following:
-
We turn the crank once. That loads bits (4,5,6,7) into our
CPU.
-
We throw a switch on the crank and turn it again, which
clears the first two gears on our CPU.
We are then ready to make the calculation.
The bits are at gears (4,5)
We start with the same alignment: we align our CPU on gear 4.
+--==
+---------------------+ |
| || || || || | |
+---------------------+----+
|| || || ||
=============================================================
... || || || || || || || || || || ...
--------------------------------------------------------
... || || || || || || || || || || ...
---------------------------------------------------------------
... 2 3 4 5 6 7 8 9 10 11 ...
Then we do the following:
- We turn the crank once. That loads bits (4,5,6,7) into our
CPU.
- We throw another switch on the crank and turn the crank again,
which copies gears (4,5) to (5,6).
- We throw the original switch and turn the crank again, and
that clears gears 4 and 5.
We are now ready to make the calculation.
There are other cases, too
Other cases do not arise on our 4-bit computer doing 2-bit calculations but
might arise if we considered 1-bit calculations. The bit of interest might
not be on the left or on the right but in the middle.
You can work that one out for yourself. It requires yet another turn of
crank.
Summary, working with numbers shorter than 4 bits
R2. If we have a k-bit–wide computer and performing an operation
on a k-bit number requires n turns of the crank, then
performing the operation on an m < k bit number requires
n+2, n+3, or n+4 turns of the crank, depending on where the
number is located.
R2 applies only when k/m is an integer >1.
Surprise!
When clock speed is held constant, 64-bit computers are not faster than
32-bit computers at everything!
Say you want to do a calculation on a 32-bit number that requires n
turns of the crank on a 32-bit computer. That calculation will require n+2,
n+3, or n+4 turns of the crank on a 64-bit computer.
With modern computers, what we want to do can often be done in one or two
turns of the crank. On a 64-bit computer, the overhead for such
calculations is enormous.
So now let us consider the problem carefully, and let me add some more
information:
1. One reason to have a k-bit computer is that you want
to perform calculations on m ≥ k quantities.
For 64-bit computers and Stata, that concept corresponds to
to double-precision numbers. Stata does many such
calculations.
64-bit computers will perform calculations faster on 64-bit
quantities.
2. 64-bit computers will perform calculations slower on 32-bit
quantities. Thirty-two-bit quantities arise all the time in programs.
They are used, for instance, as loop counters and indexes.
However, some 64-bit computers are faster than others when
working with short quantities. In our illustration,
I showed a k-bit computer as having its memory organized in k-bit
groups. The computer does not have to be designed that
way, although it adds considerable complication to the design
if the manufacturer relaxes the constraint.
Intel and AMD relax the constraint, which saves many turns
of the crank and alleviates many of the disadvantages of
making the computer wider.
Finally, 64-bit computers can address more memory, but my illustration does
not demonstrate that capacity.
|