Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Efficient way to predict values from regressions on subsets of the data?

Subject   st: Efficient way to predict values from regressions on subsets of the data?
Date   Fri, 15 Apr 2011 17:35:23 -0400

Hello all,

I have a project that involves assembling a panel of data in long format 
and running (quantile) regressions for each institution.  My basic problem 
involves running estimations on subsets of the data and keeping predicted 
values  from each of the regressions.  I can't use -by:- unless I write a 
wrapper, but this will be slow anyway because it uses if qualifiers (see 
below).   I have implemented this in both SAS and Stata and my SAS code is 
about 100 times faster than my best Stata implementation.
The panel is unbalanced, but to give you an idea the average number of 
time periods is 650 and the number of firms is over a thousand.   For each 
firm I need to run three regressions, taking predicted values from two and 
a coefficient from the third, and combining these three items into a new 
variable.  I have been having trouble finding a way to do this 

One way would be to loop over all firms and use if qualifiers in the 
regressions and predictions.  I have found this to be very slow, using if 
clauses on such a long dataset is very very slow,  the procedure seems to 
take around 4 to 40 seconds per firm!

My code now  is a bit cumbersome but faster, but involves reshaping the 
data into wide format to avoid using if qualifiers.  I split the data into 
10 pieces by firm, then reshape each of these 10 pieces into wide 
format.   I am splitting into 10 files because Stata's reshape command is 
quite slow (25-30 minutes for me) in reshaping my panel from long to wide, 
but splitting into 10 the reshape only takes a few seconds each.  Then I 
have 2 layers of loops: one over the 10 files and then over the firms 
inside each file, running the estimation and generating new variables for 
each of the firms results.  This method is much faster, there are no if 
qualifiers because the data is in wide format.  It takes about 0.5-1.2 
seconds to run each firm.  Overall, including the reshaping, this 
procedure takes maybe 20-30 minutes to run.

Unfortunately for Stata fans (including myself), I was able to get this 
entire thing to run in about 50 seconds in SAS, or about 0.04 seconds per 
firm!  The trick is that SAS can automatically run quantile regressions 
-by- a panel variable AND output predicted values at the same time.  But, 
I would like to keep everything in Stata if I can.  Does anyone have a 
suggestion on a more efficient method of implementing what I am doing?  
Would using the -in- qualifier instead of -if- be worth it?


Daniel Green
Research & Statistics Group
Federal Reserve Bank of New York

This e-mail message, including attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information.  If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index