[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Austin Nichols" <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: rolling regression calculation speed and large data sets |

Date |
Fri, 3 Oct 2008 03:11:18 -0400 |

Malcolm Wardlaw <malcolm@mail.utexas.edu>: The described phenomenon seems odd to me, and worth some further investigation, but have you considered generating those variables using lag operators (h tsvarlist) and a by: prefix instead of looping over obs and running regressions? That approach would have the added advantage of ensuring you are not tripped up by any missing time periods, assuming you have -tsset- properly (e.g. if some company 36 obs in a row are not for 36 consecutive trading days but for 50, say, because of missing obs). webuse grunfeld, clear ren mvalue y ren kstock x set type double g xy=x*y g xx=x^2 by com: g sumxx=xx+l.xx+l2.xx+l3.xx+l4.xx by com: g sumxy=xy+l.xy+l2.xy+l3.xy+l4.xy by com: g sumx=x+l.x+l2.x+l3.x+l4.x by com: g sumy=y+l.y+l2.y+l3.y+l4.y g b=(5*sumxy-sumx*sumy)/(5*sumxx-sumx^2) reg y x in 1/5, nohe reg y x in 2/6, nohe l com y x b in 1/6 On Thu, Oct 2, 2008 at 7:48 PM, Malcolm Wardlaw <malcolm@mail.utexas.edu> wrote: > Here's what's weirding me out. If I run this program on the entire > dataset, about 2.5 million observations and about 143MB worth of data > and overhead, the program takes around .15 seconds or so, on average, to > perform a single regression. Fine. As you can see, I purposefully > timed the other tasks as well, and they're essentially negligible in > terms of time added. The average time results remain the same whether I > stop the routine at a million or at 2000. > > However, if I pare the dataset down to 2000 observations and then run > the program on just that set (now around 200kB) the whole program > screams and each regression calculates at around .0003 seconds or > something equally negligible. So that's weird. But here's the really > strange part. If I rewrite the program as a loop that 1. grabs the > data, 2. peels off a few thousand observations and drops the rest, 3. > runs the regressions, 4. appends them to a new running data set, 5. > saves that set, and then 6. reloads the old set and do the same thing > again until I'm done, the whole program is an order of magnitude faster > than if I had kept the data in memory the whole time. This is obviously > in spite of all the hard disk work its now doing. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: rolling regression calculation speed and large data sets***From:*Malcolm Wardlaw <malcolm@mail.utexas.edu>

- Prev by Date:
**Re: st: Re: limiting observations within a program** - Next by Date:
**Re: st: corr_svy & covariance matrix for survey data** - Previous by thread:
**st: rolling regression calculation speed and large data sets** - Next by thread:
**st: Re: rolling regression calculation speed and large data sets** - Index(es):

© Copyright 1996–2023 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |