Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: rolling regression calculation speed and large data sets


From   Malcolm Wardlaw <malcolm@mail.utexas.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: rolling regression calculation speed and large data sets
Date   Thu, 02 Oct 2008 18:48:32 -0500

I have a question that's been really puzzling me.  I'm going to try and
state it as I think I'm seeing it, but if I'm not even seeing the
problem the right way, please tell me.  I'm using Stata 9 IC by the way.

I've written a program that will calculate and record the coefficients
(and other information) on a rolling regression of around 2 million
monthly regressions.  I'll post the code at the bottom.  It's probably a
familiar procedure to some of you.

Here's what's weirding me out.  If I run this program on the entire
dataset, about 2.5 million observations and about 143MB worth of data
and overhead, the program takes around .15 seconds or so, on average, to
perform a single regression.  Fine.  As you can see, I purposefully
timed the other tasks as well, and they're essentially negligible in
terms of time added.  The average time results remain the same whether I
stop the routine at a million or at 2000.

However, if I pair the dataset down to 2000 observations and then run
the program on just that set (now around 200kB) the whole program
screams and each regression calculates at around .0003 seconds or
something equally negligible.  So that's weird.  But here's the really
strange part.  If I rewrite the program as a loop that 1. grabs the
data, 2. peels off a few thousand observations and drops the rest, 3.
runs the regressions, 4. appends them to a new running data set, 5.
saves that set, and then 6. reloads the old set and do the same thing
again until I'm done, the whole program is an order of magnitude faster
than if I had kept the data in memory the whole time.  This is obviously
in spite of all the hard disk work its now doing.

Can anyone explain to me why this is?  My understanding of Stata's
memory management and processing control is really fuzzy.

Also, does anyone know how Stata's data retention properties work.  It
seems clear that if I clear a big dataset and then reload it after
messing around with a small dataset, the time it takes to reload the
original dataset is tiny.  I've never had a good explanation of what
Stata is doing and whether it's behavior that should be exploited by
programmers.

-------------------------------------------------
capture drop CAPM*
qui gen CAPM_R2=.
qui gen CAPM_alpha=.
qui gen CAPM_beta=.
qui gen CAPM_se=.
qui gen CAPM_N=.
timer clear
local end=_N     /* Note that I have tried this with end=2000 as well */
local cur=1
while `cur' <= `end' {
    timer on 1
    local p2=permno[`cur']
    local p1=permno[`cur'-35]
        if `p1'==`p2' {
        local first=`cur'-35
        timer on 2
        quietly reg ret mkt in `first'/`cur'
        timer off 2
        timer on 3
/*  Write down all that crap */
        qui replace CAPM_R2=`e(r2)' in `cur'
        qui replace CAPM_alpha= _b[_cons] in `cur'
        qui replace CAPM_beta=_b[mkt] in `cur'
        qui replace CAPM_se=_se[mkt] in `cur'
        qui replace CAPM_N=`e(N)'  in `cur'
/* Done, now loop again */
        timer off 3
        }
    local cur=`cur'+1
    timer off 1
    }
timer list
-------------------------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index