Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Speed with large panel datasets


From   Gordon Hughes <[email protected]>
To   [email protected]
Subject   st: Speed with large panel datasets
Date   Mon, 21 Mar 2011 10:39:07 +0000

This is partly a comment and partly a query. I have a rather large dataset > 500,000 observations which consists of an unbalanced panel of up to 1600 observations for about 450 panel units. I am carrying out panel-specific ARIMA analyses for some or all panel units in order to estimate the distribution of a set of coefficients across panels.

It turns out that the time required to execute the task varies by at least an order of magnitude depending upon how I set up the analysis. By far the slowest method is to embed the -arima- command in a loop of the following kind:

forval i=1/`npanel' {
    arima depvar indvar1 indvar2 ...  if np=`i' & <some other condition>
}

The execution is faster if I discard all panels that do not satisfy <some other condition> before initiating the loop. However, the best method is (a) discard all data which does not satisfy <some other condition>, (b) reshape the dataset into wide format so that the dependent and independent variables are stored as depvar_`np' indvar1_`np' indvar2_`np' etc, and (c) execute the loop

forval i=1/`npanel' {
    arima depvar_`i' indvar1_`i' indvar2_`i' ...
}

Even allowing for the time required to reshape a rather large dataset this is much faster than any alternative that I have tried. It seems that the overhead of processing missing cases in -arima- is very high. There is a "savespace" option in -arima- that is designed to reduce the amount of memory required by the command but which constructs and works with a temporary dataset but the gain in overall execution speed is much less than the reshape approach.

My query is this: is -arima- unusual in having such a large overhead in processing excluded observations or is the approach of reshaping a large dataset likely to pay off for other estimators when it is necessary to repeat an estimation procedure for a substantial number of panel units?

Gordon Hughes
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index