Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: More efficient processing of nested loops?

 From Nick Cox To statalist@hsphsun2.harvard.edu Subject Re: st: More efficient processing of nested loops? Date Wed, 15 Aug 2012 17:59:15 +0100

```The good news is that I think you are right. This code appears to be
much more complicated than it needs to be.

just means that it is too much for me to absorb -- but from looking at
your code there are several major and minor inefficiencies.

The main problems are

1. You have two outer loops tangled together, one Stata's and one

2. The inner loop is a loop over one case, and so not needed.

3. -egen- calls are very inefficient to calculate constants that
-summarize- can calculate, except that #1 implies to me that you can
do most of the work in a few -egen- calls.

4. Some copying of values from one place to another to no obvious purpose.

With some guesswork, your code to me boils down to

egen dt_temp_2 = min(date_t_4), by(cum_id_cum)
egen cusip_mode = mode(cusip), by(cum_id_cum)
local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
replace cum_ab = r(sum) if `mark'

Note that your use of "temporary variables" is not the same as Stata's.

Nick

On Wed, Aug 15, 2012 at 4:26 PM, Robson Glasscock <glasscockrc@vcu.edu> wrote:

> I am running Stata 12. I have written code that creates a variable
> that is the sum of abnormal returns for each firm. The abnormal
> returns are accumulated from three days after the firm makes its
> earnings announcement in quarter t-4 to three days after the firm
> makes its earnings announcement in quarter t. My problem is that it
> takes around 2 minutes for my code to execute for each firm/quarter
> announcement, and there are around 150,000 earnings announcements in
> the dataset.
>
> I modified a panel dataset with the earnings announcement dates for
> each firm so that each observation contains both the quarter t
> earnings announcement date and the quarter t-4 earnings announcement
> date (with the three days added to each per above). I then merged this
> dataset with a second panel dataset that contains the daily returns
> for each firm. The merged dataset has around 10.2 million
> observations.
>
> Next, I created a count variable, cum_id_cum, which is a running total
> of each earnings announcement. This variable is truly cumulative and
> does not reset back to 1 when the next firm releases its first
> earnings announcement. The loop contains a variable, runn, that starts
> with a value equal to "1" and increases by 1 each time the loop is
> processed. I'm using that to help identify the particular dates of the
> quarter t and quarter t-4 earnings announcement so that the abnormal
> returns are accumulated over the correct period. Datadate_2 is the
> quarter t earnings announcement and  date_t_4 is the quarter t-4
> earnings announcement. Cusip is the identifier for each firm. Date is
> the date of the firm's return in the stock market for each trading
> day.
>
> The big picture of my approach was to create temporary variables that
> will equal the cusip, datadate_2, and date_t_4 when runn equals
> cum_id_cum. These first-step temporary variables (d2_temp, dt_temp,
> and cusip_temp) have missing values except in the observation where
> runn equals cum_id_cum so I created second-step temporary variables
> (cusip_temp_2, d2_temp_2, and dt_temp_2) which place the cusip,
> datadate_2, and date_t_4 values for each particular run through the
> entire dataset. This allows me to mark the days for each firm that are
> between the dates of the earnings announcements and then sum up the
> abnormal returns in a temporary variable called cum_ab_temp. The final
> variable with the sum of the abnormal returns for each firm is cum_ab
> and is retained in the observation where _merge==3 (from the merge
> mentioned in the second paragraph of this post).
>
> My code is below. Note that I constrain it to the first 25,000
> cum_id_cum values due to macro size constraints:
>
> gen runn= 1
> levelsof cum_id_cum if cum_id_cum !=. & cum_id_cum <25000, local(cum_id_cum)
> foreach 1 of local cum_id_cum{
> gen d2_temp= datadate_2 if runn== cum_id_cum
> gen dt_temp= date_t_4 if runn== cum_id_cum
> gen cusip_temp= cusip if runn== cum_id_cum
> egen cusip_temp_2= mode(cusip_temp)
> egen d2_temp_2= min(d2_temp)
> egen dt_temp_2= min(dt_temp)
> foreach x of varlist date{
> replace mark= 1 if `x' <= d2_temp_2 & `x' >= dt_temp_2 & cusip_temp_2==cusip
> egen cum_ab_temp= total(ab_ret) if mark==1
> replace cum_ab= cum_ab_temp if datadate_2== d2_temp & dt_temp==
> date_t_4 & cusip_temp==cusip & d2_temp !=. & dt_temp !=.
> replace runn= runn+1
> drop d2_temp
> drop dt_temp
> drop d2_temp_2
> drop dt_temp_2
> drop cusip_temp
> drop cusip_temp_2
> drop cum_ab_temp
> replace mark=0
>
> }
> }
>
> I'm wondering if there is a more efficient way to do the above which
> will result in a significantly faster processing time. My fear is that
> the above ignores the functionality of Stata and instead uses
> inefficient brute force.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```