Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | vwiggins@stata.com (Vince Wiggins, StataCorp) |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Overriding a loop if 0 observations using tabstat |
Date | Thu, 29 Apr 2010 12:45:13 -0500 |
Several people have commented to me, both publicly and privately, about their plans to use my suggested memory compaction steps to boost speed (in rare circumstances). These comments culminated in the suggestion by Robert Picard <picard@netbox.com> that we (StataCorp) consider adding a -compact- command to formalize these steps and obviate the need to set memory. I must have appeared inadvertently optimistic about the benefits of these steps. What I was trying to say in my posting is, This is a very unusual dataset. The odds of you encountering one like it are truly small. Even if you encounter one, there is a good chance that compacting your dataset will make no difference. You must be using the same data over and over again, without using any other data in between. Why are you so unlikely to have Stata data that confounds your computer's cache architecture? First, all of the data, including any temporary variables that Stata will create on your behalf, must fit in the cache on one CPU -- typically a maximum of 8 MB, but more likely 2 MB or 4 MB and maybe as little as .5 MB on your computer. Second, you must have more observations than available cache lines on you CPU -- most CPUs will have at least 30,000 available cache lines. So, you need a dataset small enough to fit in you cache (leaving room for Stata and operating system code) and has more than 30,000 observations. You must also use that data over and over again without doing anything else in between. Why, nothing in between? Because that would flush the cache and you would have to repopulate it from slower "standard" memory. All of this happens, but not often. Most often it would happen with a simulation or bootstrap, where there would be no intervening computations. These same simulations and bootstraps often use estimation commands that will create temporary variables on your behalf -- a maximum likelihood estimator will generally create between 4 and 14 double-precision (8-byte) variables. The larger number of variables is created when scores are created for robust or cluster-robust SEs. Even if your data and task fit all of these characteristics, you may still not need to compact; caching will work just fine for a wide range of memory settings and associated data organizations. Having said all of that, we StataCorpans have discussed in the past things not unlike Robert's -compact- command. The problem is that rearranging the data is not free and that cost is almost guaranteed if you type -compact-, whereas you are very unlikely to see any benefit and that likelihood will vary across computer architectures. If we implemented -compact- it would be the first command in the manual whose description would begin, "Do not use this command." -- Vince vwiggins@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/