[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Malcolm Wardlaw <malcolm@mail.utexas.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: rolling regression calculation speed and large data sets |

Date |
Thu, 02 Oct 2008 18:48:32 -0500 |

I have a question that's been really puzzling me. I'm going to try and state it as I think I'm seeing it, but if I'm not even seeing the problem the right way, please tell me. I'm using Stata 9 IC by the way. I've written a program that will calculate and record the coefficients (and other information) on a rolling regression of around 2 million monthly regressions. I'll post the code at the bottom. It's probably a familiar procedure to some of you. Here's what's weirding me out. If I run this program on the entire dataset, about 2.5 million observations and about 143MB worth of data and overhead, the program takes around .15 seconds or so, on average, to perform a single regression. Fine. As you can see, I purposefully timed the other tasks as well, and they're essentially negligible in terms of time added. The average time results remain the same whether I stop the routine at a million or at 2000. However, if I pair the dataset down to 2000 observations and then run the program on just that set (now around 200kB) the whole program screams and each regression calculates at around .0003 seconds or something equally negligible. So that's weird. But here's the really strange part. If I rewrite the program as a loop that 1. grabs the data, 2. peels off a few thousand observations and drops the rest, 3. runs the regressions, 4. appends them to a new running data set, 5. saves that set, and then 6. reloads the old set and do the same thing again until I'm done, the whole program is an order of magnitude faster than if I had kept the data in memory the whole time. This is obviously in spite of all the hard disk work its now doing. Can anyone explain to me why this is? My understanding of Stata's memory management and processing control is really fuzzy. Also, does anyone know how Stata's data retention properties work. It seems clear that if I clear a big dataset and then reload it after messing around with a small dataset, the time it takes to reload the original dataset is tiny. I've never had a good explanation of what Stata is doing and whether it's behavior that should be exploited by programmers. ------------------------------------------------- capture drop CAPM* qui gen CAPM_R2=. qui gen CAPM_alpha=. qui gen CAPM_beta=. qui gen CAPM_se=. qui gen CAPM_N=. timer clear local end=_N /* Note that I have tried this with end=2000 as well */ local cur=1 while `cur' <= `end' { timer on 1 local p2=permno[`cur'] local p1=permno[`cur'-35] if `p1'==`p2' { local first=`cur'-35 timer on 2 quietly reg ret mkt in `first'/`cur' timer off 2 timer on 3 /* Write down all that crap */ qui replace CAPM_R2=`e(r2)' in `cur' qui replace CAPM_alpha= _b[_cons] in `cur' qui replace CAPM_beta=_b[mkt] in `cur' qui replace CAPM_se=_se[mkt] in `cur' qui replace CAPM_N=`e(N)' in `cur' /* Done, now loop again */ timer off 3 } local cur=`cur'+1 timer off 1 } timer list ------------------------------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: Re: rolling regression calculation speed and large data sets***From:*Malcolm Wardlaw <malcolm@mail.utexas.edu>

**Re: st: rolling regression calculation speed and large data sets***From:*"Austin Nichols" <austinnichols@gmail.com>

- Prev by Date:
**Re: st: do file script from text wrangler** - Next by Date:
**Re: st: Re: limiting observations within a program** - Previous by thread:
**st: limiting observations within a program** - Next by thread:
**Re: st: rolling regression calculation speed and large data sets** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |