[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: rolling regression calculation speed and large data sets

From   "Austin Nichols" <[email protected]>
To   [email protected]
Subject   Re: st: rolling regression calculation speed and large data sets
Date   Fri, 3 Oct 2008 03:11:18 -0400

Malcolm Wardlaw <[email protected]>:
The described phenomenon seems odd to me, and worth some further
investigation, but have you considered generating those variables
using lag operators (h tsvarlist) and a by: prefix instead of looping
over obs and running regressions?  That approach would have the added
advantage of ensuring you are not tripped up by any missing time
periods, assuming you have -tsset- properly (e.g. if some company 36
obs in a row are not for 36 consecutive trading days but for 50, say,
because of missing obs).

webuse grunfeld, clear
ren mvalue y
ren kstock x
set type double
g xy=x*y
g xx=x^2
by com: g sumxx=xx+l.xx+l2.xx+l3.xx+l4.xx
by com: g sumxy=xy+l.xy+l2.xy+l3.xy+l4.xy
by com: g sumx=x+l.x+l2.x+l3.x+l4.x
by com: g sumy=y+l.y+l2.y+l3.y+l4.y
g b=(5*sumxy-sumx*sumy)/(5*sumxx-sumx^2)
reg y x in 1/5, nohe
reg y x in 2/6, nohe
l com y x b in 1/6

On Thu, Oct 2, 2008 at 7:48 PM, Malcolm Wardlaw <[email protected]> wrote:
> Here's what's weirding me out.  If I run this program on the entire
> dataset, about 2.5 million observations and about 143MB worth of data
> and overhead, the program takes around .15 seconds or so, on average, to
> perform a single regression.  Fine.  As you can see, I purposefully
> timed the other tasks as well, and they're essentially negligible in
> terms of time added.  The average time results remain the same whether I
> stop the routine at a million or at 2000.
> However, if I pare the dataset down to 2000 observations and then run
> the program on just that set (now around 200kB) the whole program
> screams and each regression calculates at around .0003 seconds or
> something equally negligible.  So that's weird.  But here's the really
> strange part.  If I rewrite the program as a loop that 1. grabs the
> data, 2. peels off a few thousand observations and drops the rest, 3.
> runs the regressions, 4. appends them to a new running data set, 5.
> saves that set, and then 6. reloads the old set and do the same thing
> again until I'm done, the whole program is an order of magnitude faster
> than if I had kept the data in memory the whole time.  This is obviously
> in spite of all the hard disk work its now doing.
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index