Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Overriding a loop if 0 observations using tabstat


From   Richard Goldstein <[email protected]>
To   [email protected]
Subject   Re: st: Overriding a loop if 0 observations using tabstat
Date   Thu, 29 Apr 2010 08:32:39 -0400

Vince,

thank you very much for this detailed explanation; since I do a lot of
work on files of very different sizes where almost all the work is data
management, it appears that I might be able to save some time on some of
these multi-hour runs.

Rich

On 4/28/10 5:27 PM, Vince Wiggins, StataCorp wrote:
> Short question
> --------------
> 
> In a thread that started out about looping then became about relative speed
> with differing amounts of memory, Richard Goldstein <[email protected]>,
> Robert Picard <[email protected]>, and Kit Baum <[email protected]> have written
> about substantial differences in runtimes when performing many iterations of
> some very simple computations on a very small dataset.  They attribute the
> difference in runtimes to differing amounts of memory allocated to Stata.
> 
> 
> Short answer
> ------------
> 
> First, I believe their timings.  Second, I am not concerned.
> 
> There is nothing to worry about here.  These timings are an artifact of a
> toy-size example combined with the way modern computers handle cache memory.
> You will see timing differences of this sort only on small datasets and when
> when doing easy computations that are performed repeatedly -- Robert, Rich,
> and Kit (RRK) are computing means and counts over a 100,000 observation
> dataset comprised of a single variable.  The only reason they observe much
> timing passing at all is that they are iterating the task 10,000 times.
> 
> 
> Long question
> -------------
> 
> The task that RRK are timing came from Nick Cox <[email protected]> 
> (who at the time was merely trying to show that -count- was faster than
> -summarize, meanonly- and that both were faster than -summarize-.)  Here is
> the do file.
> 
>     -------------------------------- BEGIN --- timings.do --- CUT HERE -------
>     clear
>     set obs 100000
>     set seed 2803
>     gen y = runiform()
>     set rmsg on
>     
>     qui forval i = 1/10000 {
>             count if y > 0.5
>     }
>     
>     qui forval i = 1/10000 {
>             summarize y if y > 0.5, meanonly
>     }
> 
>     qui forval i = 1/10000 {
>             summarize y if y > 0.5
>     }
>     --------------------------------   END --- timings.do --- CUT HERE -------
> 
> RRK are timing each of the loops -- the count loop, the summarize meanonly
> loop, and the full summarize loop.  They are performing the timings with
> different amounts of memory allocated to Stata.
> 
> Here are some of the timings reported by RRK (rounded to seconds).
> 
>                                summarize
>             memory    count    meanonly   summarize
>     ---------------+-------------------------------
>     Rich     10 MB |    6         50          56
>     Rich    500 MB |    6         99         104
>                    |
>     Robert   10 MB |   11         43          47
>     Robert 1000 MB |   17         64          71
>                    |
>     Kit     500 MB |   14         61          72
> 
>     Kit did not report timings for other memory configurations, but
>     surprisingly noted that reducing the memory allocated to Stata increased
>     the runtimes -- the opposite of Rich and Robert
> 
> Robert and Rich note the big differences in some of the timings when different
> amounts of memory are allocated.
> 
> All of our intrepid timers are running on MacPro computers though the
> computers have different numbers of cores and different clock speeds.  None of
> that matters, this could happen on any modern computer.
> 
> 
> Really, really, short answer
> ----------------------------
> 
> Computers are weird.
> 
> Or, more accurately, computers are weird when you operate at the boundary of
> what will fit in their cache memories.
> 
> 
> Long answer
> -----------
> 
> First, we need to talk a bit about computer architectures.  The processor (or
> core) performs very simple tasks, such as add 2 numbers from two of its
> internal registers and put them in another register (or more likely in one of
> the original registers).
> 
>             core
> 
>            register 1 +
>            register 2 =
>            register 3
> 
> A simple instruction like this can occur in a single clock cycle and many
> modern processors can execute on the order of 3 billion of these per second.
> MacPro computers use Intel Xeon processors which can run at speeds up to 3.3
> GHz or 3.3 billion instructions per second.  At this point Bill Gould might
> draw a hand-crank that could be turned 3 billion times per second, but that is
> well beyond my limited ASCII art abilities.
> 
> The numbers, unfortunately, are often not in the registers and the computer
> must fetch them from memory.  All modern CPUs have a large cache memory from
> which they can pull numbers very quickly.  The cache is not the main memory on
> the computer -- the memory we allocate in Stata -- it is the processor's own
> memory for storing data and instructions that it is using repeatedly.  On the
> MacPros that RRK are running each CPU has 8 MB of on-chip cache.
> 
>             core                cache
> 
>                                +------+
>            register 1 +  <---->|      |
>            register 2 =  <---->|      |
>            register 3    <---->|      |
>                                |      |
>                                  ...
> 
> Getting numbers from the cache to the core registers is extremely fast.  We
> will ignore the time required, if only because after quite a bit of searching
> I still cannot find the L3 cache latency for the Intel Xeon 3500 processors
> and because with modern pipelining the delays between the core and cache truly
> are minimal.
> 
> If the core cannot find the numbers it wants in cache, then it must pull the
> numbers from standard memory into cache.  Standard memory is where Stata
> stores your data.
> 
>                       incredibly
>                          fast         slower
>                            |             |
>             core           |    cache    |       memory
>                            |             |
>                            V   +------+  V   +---------------+
>            register 1 +  <---->|      |<---->|               |
>            register 2 =  <---->|      |<---->|               |
>            register 3    <---->|      |<---->|               |
>                                |      |      |               |
>                                  ...                ...
> 
> The problem with pulling something from standard memory is that it is
> relatively slow.  The memory on RRK's MacPros runs at 1066 MHz, meaning we can
> access just over 1 billion numbers per second, and that is the best we can
> hope for.  It takes over 3-times as long to pull a number from standard memory
> as it does from cache.  So, if the core must wait for a number from memory, it
> will be idle for 2 or 3 cycles.  It is as if your computer is only one-third
> as fast while it waits for the cache to be filled with what it needs.
> 
> There is one other thing that we need to know.  Single numbers are not moved
> from memory to cache, that would be too inefficient, rather a fixed sequential
> number of bytes are moved.  This sequence of bytes is called a cache line.
> More correctly, a line in cache represents a fixed number of sequential bytes
> in memory and there are a fixed number of cache lines.  The last time I looked
> seriously at cache management, cache lines were 128 or 256 bytes and that
> number had been growing.  Having just now stared pretty hard at the 550 page
> Intel® Xeon® Processor C5500/C3500 Series Datasheet I am still not sure what
> the size of a cache line is on RRK's computers.  There is some discussion
> about 64 bytes for the QPI cache lines, but those are for intercore
> communication, though they might also be used for cache management.
> 
> I am inclined to think the cache lines are at least 128 bytes.  That is the
> smallest sequential chunk of memory that can be moved into cache.  With 8 MB of
> cache, that would imply 8*2^20/122 = 65,536 cache lines.  We will need that
> number later.
> 
> None of this discussion changes for multi-core CPU's.
> 
> What does that have to do with RRK's timings?  RRK's dataset is small and
> unusual.  It is a single 4-byte variable, y, with 100,000 observations.  It
> takes just .4 MB to store y itself.  Stata is going to add twice that amount
> for the its own purposes, but that is still only 1.2 MB for the entire
> dataset.  We can think of the data as being stored in a rectangular memory
> location with 8+4=12 byte records.
> 
>            Stata     y
> 	  +------------+
> 	  |1235678|1234|
> 	  |1235678|1234|
> 	  |1235678|1234|
>           |  ...  | ...|
> 
> Stata datasets usually are not stored this densely.  Normally, there would be
> free space at the end of each record where more variables can be added.
> 
>            Stata     y    free space
> 	  +-----------------------------
> 	  |1235678|1234|  ...
> 	  |1235678|1234|
> 	  |1235678|1234|
>           |  ...  | ...|  ...
> 
> Moreover, you are likely to have even more free space at the end of each
> record if you have allocated more memory to Stata.  This lets Stata add and
> drop variables quickly.
> 
> So, with 10 MB allocated, RRK's data might look like
> 
>            Stata     y    free space
> 	  +---------------------+
> 	  |1235678|1234|12345678|
> 	  |1235678|1234|12345678|                   (1)
> 	  |1235678|1234|12345678|
>           |  ...  | ...|   ...  |
> 
> And, with 1000 MB allocated, it might look like
> 
>            Stata     y    free space
> 	  +-------------------------------------
> 	  |1235678|1234|123456789...  ... ...
> 	  |1235678|1234|123456789...  ... ...       (2)
> 	  |1235678|1234|123456789...  ... ...
>           |  ...  | ...|   ...      
> 
> With the dataset organized as in (1), each record is 20 characters wide,
> including free space, and so there is enough room to store all of the data,
> including free space in the cache.  With the dataset organized as in (2), that
> might not be true.  Since we have 100,000 records and 8 MB of cache, if the
> records are wider than 8*2^20/100000 = 83.9 characters, then the entire data
> area will not fit into cache.
> 
> We could think about it differently and assume that each record is longer than
> a cache line.  We would then need 100,000 cache lines to get all of the y
> values into cache, but we computed earlier that these CPUs are like to have
> only 65,536 cache lines (or maybe half that).  Even if they had more, many
> lines would be dedicated to the instructions required to run Stata and many
> others for the data and code required by the operating system.
> 
> Rich's and Robert's runs with 10 MB allocated produced a data organization
> that fit entirely into cache.  Their runs with 500 MB and 1000 MB did not.
> 
> This task was generated nearly at random, but it would be difficult to
> construct a task better suited to confound the cache management of many modern
> computers.  The organization of this dataset (1x100,000) and the fact that the
> commands spend little time processing each record is tailor made to expose the
> cache issues discussed above.  On my Linux computer with an Intel Core2 Quad
> 9550 CPU and 6 MB of cache, reducing the observations to 30,000 means that I
> have enough cache lines, and I therefore get consistent timings regardless of
> how large I set memory.
> 
> Also, do not think that you would be better off if Stata kept the data
> organized densely.  That would make this one unusual problem sometimes faster
> with large memory allocation, but in general would slow Stata down on other
> operations far more than this small increase.
> 
> To elaborate on my earlier comment that "Computers are weird", it applies
> primarily to cases where you are operating on the boundary of what can and
> cannot fit into cache.  With Stata, that is particularly true when we are
> talking about the dataset fitting into cache.  Most of the time the data for
> whatever you are doing in Stata is in standard memory.  One or a few
> observations are pulled into cache where substantial computation is done on
> each record -- such as forming a cross product or evaluating a likelihood.
> When we do more serious computations on each observations (rather than just
> summing or counting) the effect of not having the entire dataset in cache is
> mitigated.
> 
> It is unusual to have a problem where ALL of the data for the task can fit
> into cache and you care how long it takes to run your task on this small
> dataset.  This might be true if you are bootstrapping or simulating using a
> small dataset.  If you have a problem like this, you may be able to speed it
> up by tuning the amount of memory allocated to Stata.
> 
>     1) Drop all variables and observations that are not part of
>        your analysis
> 
>     2) Save the resulting dataset
> 
>     3) Set memory to the minimum amount to hold your data and
>        perform your analysis.  This may require some trial and
>        error.
> 
>     4) Perform your analysis using the the smaller dataset.
> 
> These steps ensure that Stata has no more free space at the end of
> observations than it requires for any temporary variables required to perform
> your analysis.
> 
> You might benefit from this recommendation if you regularly work with datasets
> that have only one or two variables and you are doing fast commands (say up to
> and including linear regression). Otherwise, this is rarely worth the trouble.
> 
>  
> -- Vince 
>    [email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index