Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Overriding a loop if 0 observations using tabstat


From   vwiggins@stata.com (Vince Wiggins, StataCorp)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Overriding a loop if 0 observations using tabstat
Date   Wed, 28 Apr 2010 16:27:32 -0500

Short question
--------------

In a thread that started out about looping then became about relative speed
with differing amounts of memory, Richard Goldstein <richgold@ix.netcom.com>,
Robert Picard <picard@netbox.com>, and Kit Baum <baum@bc.edu> have written
about substantial differences in runtimes when performing many iterations of
some very simple computations on a very small dataset.  They attribute the
difference in runtimes to differing amounts of memory allocated to Stata.


Short answer
------------

First, I believe their timings.  Second, I am not concerned.

There is nothing to worry about here.  These timings are an artifact of a
toy-size example combined with the way modern computers handle cache memory.
You will see timing differences of this sort only on small datasets and when
when doing easy computations that are performed repeatedly -- Robert, Rich,
and Kit (RRK) are computing means and counts over a 100,000 observation
dataset comprised of a single variable.  The only reason they observe much
timing passing at all is that they are iterating the task 10,000 times.


Long question
-------------

The task that RRK are timing came from Nick Cox <n.j.cox@durham.ac.uk> 
(who at the time was merely trying to show that -count- was faster than
-summarize, meanonly- and that both were faster than -summarize-.)  Here is
the do file.

    -------------------------------- BEGIN --- timings.do --- CUT HERE -------
    clear
    set obs 100000
    set seed 2803
    gen y = runiform()
    set rmsg on
    
    qui forval i = 1/10000 {
            count if y > 0.5
    }
    
    qui forval i = 1/10000 {
            summarize y if y > 0.5, meanonly
    }

    qui forval i = 1/10000 {
            summarize y if y > 0.5
    }
    --------------------------------   END --- timings.do --- CUT HERE -------

RRK are timing each of the loops -- the count loop, the summarize meanonly
loop, and the full summarize loop.  They are performing the timings with
different amounts of memory allocated to Stata.

Here are some of the timings reported by RRK (rounded to seconds).

                               summarize
            memory    count    meanonly   summarize
    ---------------+-------------------------------
    Rich     10 MB |    6         50          56
    Rich    500 MB |    6         99         104
                   |
    Robert   10 MB |   11         43          47
    Robert 1000 MB |   17         64          71
                   |
    Kit     500 MB |   14         61          72

    Kit did not report timings for other memory configurations, but
    surprisingly noted that reducing the memory allocated to Stata increased
    the runtimes -- the opposite of Rich and Robert

Robert and Rich note the big differences in some of the timings when different
amounts of memory are allocated.

All of our intrepid timers are running on MacPro computers though the
computers have different numbers of cores and different clock speeds.  None of
that matters, this could happen on any modern computer.


Really, really, short answer
----------------------------

Computers are weird.

Or, more accurately, computers are weird when you operate at the boundary of
what will fit in their cache memories.


Long answer
-----------

First, we need to talk a bit about computer architectures.  The processor (or
core) performs very simple tasks, such as add 2 numbers from two of its
internal registers and put them in another register (or more likely in one of
the original registers).

            core

           register 1 +
           register 2 =
           register 3

A simple instruction like this can occur in a single clock cycle and many
modern processors can execute on the order of 3 billion of these per second.
MacPro computers use Intel Xeon processors which can run at speeds up to 3.3
GHz or 3.3 billion instructions per second.  At this point Bill Gould might
draw a hand-crank that could be turned 3 billion times per second, but that is
well beyond my limited ASCII art abilities.

The numbers, unfortunately, are often not in the registers and the computer
must fetch them from memory.  All modern CPUs have a large cache memory from
which they can pull numbers very quickly.  The cache is not the main memory on
the computer -- the memory we allocate in Stata -- it is the processor's own
memory for storing data and instructions that it is using repeatedly.  On the
MacPros that RRK are running each CPU has 8 MB of on-chip cache.

            core                cache

                               +------+
           register 1 +  <---->|      |
           register 2 =  <---->|      |
           register 3    <---->|      |
                               |      |
                                 ...

Getting numbers from the cache to the core registers is extremely fast.  We
will ignore the time required, if only because after quite a bit of searching
I still cannot find the L3 cache latency for the Intel Xeon 3500 processors
and because with modern pipelining the delays between the core and cache truly
are minimal.

If the core cannot find the numbers it wants in cache, then it must pull the
numbers from standard memory into cache.  Standard memory is where Stata
stores your data.

                      incredibly
                         fast         slower
                           |             |
            core           |    cache    |       memory
                           |             |
                           V   +------+  V   +---------------+
           register 1 +  <---->|      |<---->|               |
           register 2 =  <---->|      |<---->|               |
           register 3    <---->|      |<---->|               |
                               |      |      |               |
                                 ...                ...

The problem with pulling something from standard memory is that it is
relatively slow.  The memory on RRK's MacPros runs at 1066 MHz, meaning we can
access just over 1 billion numbers per second, and that is the best we can
hope for.  It takes over 3-times as long to pull a number from standard memory
as it does from cache.  So, if the core must wait for a number from memory, it
will be idle for 2 or 3 cycles.  It is as if your computer is only one-third
as fast while it waits for the cache to be filled with what it needs.

There is one other thing that we need to know.  Single numbers are not moved
from memory to cache, that would be too inefficient, rather a fixed sequential
number of bytes are moved.  This sequence of bytes is called a cache line.
More correctly, a line in cache represents a fixed number of sequential bytes
in memory and there are a fixed number of cache lines.  The last time I looked
seriously at cache management, cache lines were 128 or 256 bytes and that
number had been growing.  Having just now stared pretty hard at the 550 page
Intel® Xeon® Processor C5500/C3500 Series Datasheet I am still not sure what
the size of a cache line is on RRK's computers.  There is some discussion
about 64 bytes for the QPI cache lines, but those are for intercore
communication, though they might also be used for cache management.

I am inclined to think the cache lines are at least 128 bytes.  That is the
smallest sequential chunk of memory that can be moved into cache.  With 8 MB of
cache, that would imply 8*2^20/122 = 65,536 cache lines.  We will need that
number later.

None of this discussion changes for multi-core CPU's.

What does that have to do with RRK's timings?  RRK's dataset is small and
unusual.  It is a single 4-byte variable, y, with 100,000 observations.  It
takes just .4 MB to store y itself.  Stata is going to add twice that amount
for the its own purposes, but that is still only 1.2 MB for the entire
dataset.  We can think of the data as being stored in a rectangular memory
location with 8+4=12 byte records.

           Stata     y
	  +------------+
	  |1235678|1234|
	  |1235678|1234|
	  |1235678|1234|
          |  ...  | ...|

Stata datasets usually are not stored this densely.  Normally, there would be
free space at the end of each record where more variables can be added.

           Stata     y    free space
	  +-----------------------------
	  |1235678|1234|  ...
	  |1235678|1234|
	  |1235678|1234|
          |  ...  | ...|  ...

Moreover, you are likely to have even more free space at the end of each
record if you have allocated more memory to Stata.  This lets Stata add and
drop variables quickly.

So, with 10 MB allocated, RRK's data might look like

           Stata     y    free space
	  +---------------------+
	  |1235678|1234|12345678|
	  |1235678|1234|12345678|                   (1)
	  |1235678|1234|12345678|
          |  ...  | ...|   ...  |

And, with 1000 MB allocated, it might look like

           Stata     y    free space
	  +-------------------------------------
	  |1235678|1234|123456789...  ... ...
	  |1235678|1234|123456789...  ... ...       (2)
	  |1235678|1234|123456789...  ... ...
          |  ...  | ...|   ...      

With the dataset organized as in (1), each record is 20 characters wide,
including free space, and so there is enough room to store all of the data,
including free space in the cache.  With the dataset organized as in (2), that
might not be true.  Since we have 100,000 records and 8 MB of cache, if the
records are wider than 8*2^20/100000 = 83.9 characters, then the entire data
area will not fit into cache.

We could think about it differently and assume that each record is longer than
a cache line.  We would then need 100,000 cache lines to get all of the y
values into cache, but we computed earlier that these CPUs are like to have
only 65,536 cache lines (or maybe half that).  Even if they had more, many
lines would be dedicated to the instructions required to run Stata and many
others for the data and code required by the operating system.

Rich's and Robert's runs with 10 MB allocated produced a data organization
that fit entirely into cache.  Their runs with 500 MB and 1000 MB did not.

This task was generated nearly at random, but it would be difficult to
construct a task better suited to confound the cache management of many modern
computers.  The organization of this dataset (1x100,000) and the fact that the
commands spend little time processing each record is tailor made to expose the
cache issues discussed above.  On my Linux computer with an Intel Core2 Quad
9550 CPU and 6 MB of cache, reducing the observations to 30,000 means that I
have enough cache lines, and I therefore get consistent timings regardless of
how large I set memory.

Also, do not think that you would be better off if Stata kept the data
organized densely.  That would make this one unusual problem sometimes faster
with large memory allocation, but in general would slow Stata down on other
operations far more than this small increase.

To elaborate on my earlier comment that "Computers are weird", it applies
primarily to cases where you are operating on the boundary of what can and
cannot fit into cache.  With Stata, that is particularly true when we are
talking about the dataset fitting into cache.  Most of the time the data for
whatever you are doing in Stata is in standard memory.  One or a few
observations are pulled into cache where substantial computation is done on
each record -- such as forming a cross product or evaluating a likelihood.
When we do more serious computations on each observations (rather than just
summing or counting) the effect of not having the entire dataset in cache is
mitigated.

It is unusual to have a problem where ALL of the data for the task can fit
into cache and you care how long it takes to run your task on this small
dataset.  This might be true if you are bootstrapping or simulating using a
small dataset.  If you have a problem like this, you may be able to speed it
up by tuning the amount of memory allocated to Stata.

    1) Drop all variables and observations that are not part of
       your analysis

    2) Save the resulting dataset

    3) Set memory to the minimum amount to hold your data and
       perform your analysis.  This may require some trial and
       error.

    4) Perform your analysis using the the smaller dataset.

These steps ensure that Stata has no more free space at the end of
observations than it requires for any temporary variables required to perform
your analysis.

You might benefit from this recommendation if you regularly work with datasets
that have only one or two variables and you are doing fast commands (say up to
and including linear regression). Otherwise, this is rarely worth the trouble.

 
-- Vince 
   vwiggins@stata.com


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index