Title  Difference in performance between 32bit Stata and 64bit Stata  
Author 
Kerry Kammire, StataCorp Bill Gould, StataCorp 

Date  February 2006 
We know that going from a 1GHz CPU to a 2GHz CPU will give us roughly twice the speed. So why doesn’t going from a 32bit CPU to a 64bit CPU give us the same performance boost?
Bill Gould posted the following explanation to Statalist:
There have recently been a number of postings about the speed of 32bit Stata and 64bit Stata, summarized by Alan Riley:
32bit 64bit  sort 2.00 1.64 seconds regress .88 .80  timings from 1.4 GHz AMD Opteron 240 running 64bit Windows.
Some of you are probably wondering why the numbers in the second column are not 1.00 and .44. After all, 64 is twice 32, so shouldn’t the run times be half? Or at least approximately half?
We all know how to interpret clockcycle speeds. A 1.4GHz computer runs twice as fast as a 0.7GHz computer. Why don’t bit widths work this way?
Microprocessors have grown from being 4 bit, to 8, to 16, to 32, and now, to 64. What does this mean?
One way to picture a computer is as a machine with gears and a crank sticking out of the side:
+== < crank ++         shaft > +++       ++  The  is one gear on a shaft. 
The picture above illustrates a 4bit computer.
The number of gears corresponds to the width of the computer. The rate at which you turn the crank corresponds to the clock speed. (On a 1.4GHz computer, you turn the crank 1.4 billion times per second.)
On this 4bit computer, every turn of the crank does something to 4 bits. If we doubled the number of gears—made an 8bit computer—every turn of the crank would do something to 8 bits.
Our picture is not yet complete because we must add memory, so here is the completed picture:
< computer, which slides left and right > +== ++         +++ (cpu)     ============================================================= ...           ...  (memory) ...           ... 
In this picture, the computer slides left and right on a rail (illustrated by ====). Below the rail is memory, another set of gears, with which the gears on the computer mesh.
Let us say we want to do a calculation on a 4bit number.
To do this, we line up the CPU with appropriate position in memory—the position that contains the 4bit number on which we want to operate—and we turn the crank.
Let us say we want to do a calculation on an 8bit number. With our 4bit computer, the following is a slight oversimplification, but not by much: we line up our computer on the first 4 bits of the 8bit number, turn the crank, shift our computer right 4 bits, and turn the crank again.
This example leads to the following general rule:
R1. If we have a kbit–wide computer and performing an operation on a kbit number requires n turns of the crank, then performing the operation on an m > k bit number requires (m/k)*n turns of the crank, plus m/k realignments.
R1 applies only when m/k is an integer ≥1.
Let us say we want to do a calculation on a 2bit number. With a real computer, such problems arise often. On a 32bit computer, we may want to make a calculation on 8bit values (i.e., 1byte values, or characters) or 16bit values (i.e., 2byte values, or short integers).
On a 64bit computer, we may want to make calculations on 8bit values,
16bit values, or 32bit values (i.e., 4byte values, or integers).
However, our example computer cannot work on values shorter than 4 bits because it has four gears and each meshes with memory.
We need to take a detour and talk about alignment.
I drew my illustration pretty carefully. Look again:
+== ++         +++     ============================================================= ...           ...  ...           ...  ... 2 3 4 5 6 7 8 9 10 11 ...
Below the illustration, I have numbered the gears. Think of those as memory positions.
The memory occurs in 4bit groups on this 4bit computer.
As our computer slides left and right along memory, it can align only at certain places. We can line up our CPU on gear 4, or gear 8, etc., but not on, say, gear 3:
+== ++         +++     ============================================================= ...           ...  ...           ...  ... 2 3 4 5 6 7 8 9 10 11 ...
Above I aligned the leftmost gear of our CPU with gear 3 of memory, and now the remaining gears do not mesh.
Modern computers are like that. It is a rule of construction.
Let us draw just the memory part of our computer:
============================================================= ...           ...  ...           ...  ... 2 3 4 5 6 7 8 9 10 11 ...
Remember, we are going to make a 2bit calculation. The two bits, without loss of generality, could be at (3,4), (4,5), (6,7), or (7,8).
Actually, because of the alignment problem I described above, positions (3,4), (5,6), and (7,8) are not possible. Our computer cannot do that. That leaves (4,5) and (6,7).
Let us consider the two cases.
We align our CPU on bit 4, so that we have
+== ++         +++     ============================================================= ...           ...  ...           ...  ... 2 3 4 5 6 7 8 9 10 11 ...
Then we do the following:
We are then ready to make the calculation.
We start with the same alignment: we align our CPU on gear 4.
+== ++         +++     ============================================================= ...           ...  ...           ...  ... 2 3 4 5 6 7 8 9 10 11 ...
Then we do the following:
We are now ready to make the calculation.
Other cases do not arise on our 4bit computer doing 2bit calculations but might arise if we considered 1bit calculations. The bit of interest might not be on the left or on the right but in the middle.
You can work that one out for yourself. It requires yet another turn of crank.
R2. If we have a kbit–wide computer and performing an operation on a kbit number requires n turns of the crank, then performing the operation on an m < k bit number requires n+2, n+3, or n+4 turns of the crank, depending on where the number is located.
R2 applies only when k/m is an integer >1.
When clock speed is held constant, 64bit computers are not faster than 32bit computers at everything!
Say you want to do a calculation on a 32bit number that requires n turns of the crank on a 32bit computer. That calculation will require n+2, n+3, or n+4 turns of the crank on a 64bit computer.
With modern computers, what we want to do can often be done in one or two turns of the crank. On a 64bit computer, the overhead for such calculations is enormous.
So now let us consider the problem carefully, and let me add some more information:
1. One reason to have a kbit computer is that you want to perform calculations on m ≥ k quantities.
For 64bit computers and Stata, that concept corresponds to to doubleprecision numbers. Stata does many such calculations.
64bit computers will perform calculations faster on 64bit quantities.
2. 64bit computers will perform calculations slower on 32bit quantities. Thirtytwobit quantities arise all the time in programs. They are used, for instance, as loop counters and indexes.
However, some 64bit computers are faster than others when working with short quantities. In our illustration, I showed a kbit computer as having its memory organized in kbit groups. The computer does not have to be designed that way, although it adds considerable complication to the design if the manufacturer relaxes the constraint.
Intel and AMD relax the constraint, which saves many turns of the crank and alleviates many of the disadvantages of making the computer wider.
Finally, 64bit computers can address more memory, but my illustration does not demonstrate that capacity.