Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Overriding a loop if 0 observations using tabstat

From (Vince Wiggins, StataCorp)
Subject   Re: st: Overriding a loop if 0 observations using tabstat
Date   Thu, 29 Apr 2010 12:45:13 -0500

Several people have commented to me, both publicly and privately, about their
plans to use my suggested memory compaction steps to boost speed (in rare
circumstances).  These comments culminated in the suggestion by Robert Picard
<> that we (StataCorp) consider adding a -compact- command to
formalize these steps and obviate the need to set memory.

I must have appeared inadvertently optimistic about the benefits of these
steps.  What I was trying to say in my posting is, 

   This is a very unusual dataset. 
   The odds of you encountering one like it are truly small.

   Even if you encounter one, there is a good chance that compacting
        your dataset will make no difference.

   You must be using the same data over and over again, without using
        any other data in between.

Why are you so unlikely to have Stata data that confounds your computer's
cache architecture?  First, all of the data, including any temporary
variables that Stata will create on your behalf, must fit in the cache on one
CPU -- typically a maximum of 8 MB, but more likely 2 MB or 4 MB and maybe as
little as .5 MB on your computer.  Second, you must have more observations
than available cache lines on you CPU -- most CPUs will have at least 30,000
available cache lines.  So, you need a dataset small enough to fit in you
cache (leaving room for Stata and operating system code) and has more than
30,000 observations.

You must also use that data over and over again without doing anything else in
between.  Why, nothing in between?  Because that would flush the cache and you
would have to repopulate it from slower "standard" memory.

All of this happens, but not often.

Most often it would happen with a simulation or bootstrap, where there would
be no intervening computations.  These same simulations and bootstraps often
use estimation commands that will create temporary variables on your behalf --
a maximum likelihood estimator will generally create between 4 and 14
double-precision (8-byte) variables.  The larger number of variables is
created when scores are created for robust or cluster-robust SEs.

Even if your data and task fit all of these characteristics, you may still not
need to compact; caching will work just fine for a wide range of memory
settings and associated data organizations.

Having said all of that, we StataCorpans have discussed in the past things not
unlike Robert's -compact- command.  The problem is that rearranging the data
is not free and that cost is almost guaranteed if you type -compact-, whereas
you are very unlikely to see any benefit and that likelihood will vary across
computer architectures.  If we implemented -compact- it would be the first
command in the manual whose description would begin, "Do not use this

-- Vince

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index