Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: More efficient processing of nested loops?


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: More efficient processing of nested loops?
Date   Wed, 15 Aug 2012 18:48:58 +0100

More speed-ups:

bysort cum_id_cum (datadate2) : gen d2_temp_2 = datadate_2[1]
bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1]

On Wed, Aug 15, 2012 at 6:07 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
> replace cum_ab = r(sum) if `mark'
>
> should be
>
> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
> replace cum_ab = cum_ab_temp if `mark'
>
> On Wed, Aug 15, 2012 at 5:59 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> The good news is that I think you are right. This code appears to be
>> much more complicated than it needs to be.
>>
>> I can't follow your word description -- doesn't mean it's unclear,
>> just means that it is too much for me to absorb -- but from looking at
>> your code there are several major and minor inefficiencies.
>>
>> The main problems are
>>
>> 1. You have two outer loops tangled together, one Stata's and one
>> home-made, but neither appears necessary.
>>
>> 2. The inner loop is a loop over one case, and so not needed.
>>
>> 3. -egen- calls are very inefficient to calculate constants that
>> -summarize- can calculate, except that #1 implies to me that you can
>> do most of the work in a few -egen- calls.
>>
>> 4. Some copying of values from one place to another to no obvious purpose.
>>
>> With some guesswork, your code to me boils down to
>>
>> egen d2_temp_2 = min(datadate_2), by(cum_id_cum)
>> egen dt_temp_2 = min(date_t_4), by(cum_id_cum)
>> egen cusip_mode = mode(cusip), by(cum_id_cum)
>> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>> replace cum_ab = r(sum) if `mark'
>>
>> Note that your use of "temporary variables" is not the same as Stata's.
>>
>> Nick
>>
>> On Wed, Aug 15, 2012 at 4:26 PM, Robson Glasscock <glasscockrc@vcu.edu> wrote:
>>
>>> I am running Stata 12. I have written code that creates a variable
>>> that is the sum of abnormal returns for each firm. The abnormal
>>> returns are accumulated from three days after the firm makes its
>>> earnings announcement in quarter t-4 to three days after the firm
>>> makes its earnings announcement in quarter t. My problem is that it
>>> takes around 2 minutes for my code to execute for each firm/quarter
>>> announcement, and there are around 150,000 earnings announcements in
>>> the dataset.
>>>
>>> I modified a panel dataset with the earnings announcement dates for
>>> each firm so that each observation contains both the quarter t
>>> earnings announcement date and the quarter t-4 earnings announcement
>>> date (with the three days added to each per above). I then merged this
>>> dataset with a second panel dataset that contains the daily returns
>>> for each firm. The merged dataset has around 10.2 million
>>> observations.
>>>
>>> Next, I created a count variable, cum_id_cum, which is a running total
>>> of each earnings announcement. This variable is truly cumulative and
>>> does not reset back to 1 when the next firm releases its first
>>> earnings announcement. The loop contains a variable, runn, that starts
>>> with a value equal to "1" and increases by 1 each time the loop is
>>> processed. I'm using that to help identify the particular dates of the
>>> quarter t and quarter t-4 earnings announcement so that the abnormal
>>> returns are accumulated over the correct period. Datadate_2 is the
>>> quarter t earnings announcement and  date_t_4 is the quarter t-4
>>> earnings announcement. Cusip is the identifier for each firm. Date is
>>> the date of the firm's return in the stock market for each trading
>>> day.
>>>
>>> The big picture of my approach was to create temporary variables that
>>> will equal the cusip, datadate_2, and date_t_4 when runn equals
>>> cum_id_cum. These first-step temporary variables (d2_temp, dt_temp,
>>> and cusip_temp) have missing values except in the observation where
>>> runn equals cum_id_cum so I created second-step temporary variables
>>> (cusip_temp_2, d2_temp_2, and dt_temp_2) which place the cusip,
>>> datadate_2, and date_t_4 values for each particular run through the
>>> entire dataset. This allows me to mark the days for each firm that are
>>> between the dates of the earnings announcements and then sum up the
>>> abnormal returns in a temporary variable called cum_ab_temp. The final
>>> variable with the sum of the abnormal returns for each firm is cum_ab
>>> and is retained in the observation where _merge==3 (from the merge
>>> mentioned in the second paragraph of this post).
>>>
>>> My code is below. Note that I constrain it to the first 25,000
>>> cum_id_cum values due to macro size constraints:
>>>
>>> gen runn= 1
>>> levelsof cum_id_cum if cum_id_cum !=. & cum_id_cum <25000, local(cum_id_cum)
>>> foreach 1 of local cum_id_cum{
>>> gen d2_temp= datadate_2 if runn== cum_id_cum
>>> gen dt_temp= date_t_4 if runn== cum_id_cum
>>> gen cusip_temp= cusip if runn== cum_id_cum
>>> egen cusip_temp_2= mode(cusip_temp)
>>> egen d2_temp_2= min(d2_temp)
>>> egen dt_temp_2= min(dt_temp)
>>> foreach x of varlist date{
>>> replace mark= 1 if `x' <= d2_temp_2 & `x' >= dt_temp_2 & cusip_temp_2==cusip
>>> egen cum_ab_temp= total(ab_ret) if mark==1
>>> replace cum_ab= cum_ab_temp if datadate_2== d2_temp & dt_temp==
>>> date_t_4 & cusip_temp==cusip & d2_temp !=. & dt_temp !=.
>>> replace runn= runn+1
>>> drop d2_temp
>>> drop dt_temp
>>> drop d2_temp_2
>>> drop dt_temp_2
>>> drop cusip_temp
>>> drop cusip_temp_2
>>> drop cum_ab_temp
>>> replace mark=0
>>>
>>> }
>>> }
>>>
>>> I'm wondering if there is a more efficient way to do the above which
>>> will result in a significantly faster processing time. My fear is that
>>> the above ignores the functionality of Stata and instead uses
>>> inefficient brute force.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index