Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Speed with large panel datasets
From
Gordon Hughes <[email protected]>
To
[email protected]
Subject
st: Speed with large panel datasets
Date
Mon, 21 Mar 2011 10:39:07 +0000
This is partly a comment and partly a query. I have a rather large
dataset > 500,000 observations which consists of an unbalanced panel
of up to 1600 observations for about 450 panel units. I am carrying
out panel-specific ARIMA analyses for some or all panel units in
order to estimate the distribution of a set of coefficients across panels.
It turns out that the time required to execute the task varies by at
least an order of magnitude depending upon how I set up the
analysis. By far the slowest method is to embed the -arima- command
in a loop of the following kind:
forval i=1/`npanel' {
arima depvar indvar1 indvar2 ... if np=`i' & <some other condition>
}
The execution is faster if I discard all panels that do not satisfy
<some other condition> before initiating the loop. However, the best
method is (a) discard all data which does not satisfy <some other
condition>, (b) reshape the dataset into wide format so that the
dependent and independent variables are stored as depvar_`np'
indvar1_`np' indvar2_`np' etc, and (c) execute the loop
forval i=1/`npanel' {
arima depvar_`i' indvar1_`i' indvar2_`i' ...
}
Even allowing for the time required to reshape a rather large dataset
this is much faster than any alternative that I have tried. It seems
that the overhead of processing missing cases in -arima- is very
high. There is a "savespace" option in -arima- that is designed to
reduce the amount of memory required by the command but which
constructs and works with a temporary dataset but the gain in overall
execution speed is much less than the reshape approach.
My query is this: is -arima- unusual in having such a large overhead
in processing excluded observations or is the approach of reshaping a
large dataset likely to pay off for other estimators when it is
necessary to repeat an estimation procedure for a substantial number
of panel units?
Gordon Hughes
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/