Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Robson Glasscock <glasscockrc@vcu.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: More efficient processing of nested loops? |

Date |
Wed, 15 Aug 2012 15:18:56 -0400 |

I completely understand that, and I'm also very appreciative of the time you spent with my problem. best, r On Wed, Aug 15, 2012 at 3:02 PM, Nick Cox <njcoxstata@gmail.com> wrote: > Sorry, Robson, but I spent a while working on my previous post and > will add no more time to this thread. > Someone else may naturally be able to help. > > On Wed, Aug 15, 2012 at 7:55 PM, Robson Glasscock <glasscockrc@vcu.edu> wrote: >> Nick, >> Thank you for your help with this. Your approach gives me the return >> for a single day. There is some variation, but that day is usually the >> return on the date the two datasets were merged. These observations >> have non-missing values for datadate_2 and date_t_4. I had a difficult >> time writing this up, and I apologize if my description of the merged >> datasets or the problem was not clear enough. >> >> Dataset A has around 12 years of daily returns for 3,500 firms. >> Dataset B has the earnings announcement date for quarter t >> (datadate_2) and quarter t-4 (date_t_4) in each row. I merged these >> two datasets (1:1 cusip date) into Dataset C and ran the two loops. >> There are typically around 250 days between datadate_2 and date_t_4 >> that need to be added together. >> >> I can provide any additional information if needed. Here is the code I >> ran from your responses: >> >> bysort cum_id_cum (datadate_2) : gen d2_temp_2 = datadate_2[1] >> bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1] >> egen cusip_mode = mode(cusip), by(cum_id_cum) >> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip" >> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum) >> replace cum_ab = cum_ab_temp if `mark' >> >> I wrote the two inefficient loops based on what I would do by hand to >> sum up the abnormal returns. Go into the first cusip and identify the >> two boundary dates for the first earnings announcement. Go back >> through all the daily returns for that cusip and mark each day that is >> between these two dates. Sum up these individual returns into one >> number. Do this again for the first cusip but second earnings >> announcement, etc. >> >> Thanks again for the time you have already spent on this, >> Robson >> >> >> On Wed, Aug 15, 2012 at 1:48 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> More speed-ups: >>> >>> bysort cum_id_cum (datadate2) : gen d2_temp_2 = datadate_2[1] >>> bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1] >>> >>> On Wed, Aug 15, 2012 at 6:07 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum) >>>> replace cum_ab = r(sum) if `mark' >>>> >>>> should be >>>> >>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum) >>>> replace cum_ab = cum_ab_temp if `mark' >>>> >>>> On Wed, Aug 15, 2012 at 5:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>> The good news is that I think you are right. This code appears to be >>>>> much more complicated than it needs to be. >>>>> >>>>> I can't follow your word description -- doesn't mean it's unclear, >>>>> just means that it is too much for me to absorb -- but from looking at >>>>> your code there are several major and minor inefficiencies. >>>>> >>>>> The main problems are >>>>> >>>>> 1. You have two outer loops tangled together, one Stata's and one >>>>> home-made, but neither appears necessary. >>>>> >>>>> 2. The inner loop is a loop over one case, and so not needed. >>>>> >>>>> 3. -egen- calls are very inefficient to calculate constants that >>>>> -summarize- can calculate, except that #1 implies to me that you can >>>>> do most of the work in a few -egen- calls. >>>>> >>>>> 4. Some copying of values from one place to another to no obvious purpose. >>>>> >>>>> With some guesswork, your code to me boils down to >>>>> >>>>> egen d2_temp_2 = min(datadate_2), by(cum_id_cum) >>>>> egen dt_temp_2 = min(date_t_4), by(cum_id_cum) >>>>> egen cusip_mode = mode(cusip), by(cum_id_cum) >>>>> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip" >>>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum) >>>>> replace cum_ab = r(sum) if `mark' >>>>> >>>>> Note that your use of "temporary variables" is not the same as Stata's. >>>>> >>>>> Nick >>>>> >>>>> On Wed, Aug 15, 2012 at 4:26 PM, Robson Glasscock <glasscockrc@vcu.edu> wrote: >>>>> >>>>>> I am running Stata 12. I have written code that creates a variable >>>>>> that is the sum of abnormal returns for each firm. The abnormal >>>>>> returns are accumulated from three days after the firm makes its >>>>>> earnings announcement in quarter t-4 to three days after the firm >>>>>> makes its earnings announcement in quarter t. My problem is that it >>>>>> takes around 2 minutes for my code to execute for each firm/quarter >>>>>> announcement, and there are around 150,000 earnings announcements in >>>>>> the dataset. >>>>>> >>>>>> I modified a panel dataset with the earnings announcement dates for >>>>>> each firm so that each observation contains both the quarter t >>>>>> earnings announcement date and the quarter t-4 earnings announcement >>>>>> date (with the three days added to each per above). I then merged this >>>>>> dataset with a second panel dataset that contains the daily returns >>>>>> for each firm. The merged dataset has around 10.2 million >>>>>> observations. >>>>>> >>>>>> Next, I created a count variable, cum_id_cum, which is a running total >>>>>> of each earnings announcement. This variable is truly cumulative and >>>>>> does not reset back to 1 when the next firm releases its first >>>>>> earnings announcement. The loop contains a variable, runn, that starts >>>>>> with a value equal to "1" and increases by 1 each time the loop is >>>>>> processed. I'm using that to help identify the particular dates of the >>>>>> quarter t and quarter t-4 earnings announcement so that the abnormal >>>>>> returns are accumulated over the correct period. Datadate_2 is the >>>>>> quarter t earnings announcement and date_t_4 is the quarter t-4 >>>>>> earnings announcement. Cusip is the identifier for each firm. Date is >>>>>> the date of the firm's return in the stock market for each trading >>>>>> day. >>>>>> >>>>>> The big picture of my approach was to create temporary variables that >>>>>> will equal the cusip, datadate_2, and date_t_4 when runn equals >>>>>> cum_id_cum. These first-step temporary variables (d2_temp, dt_temp, >>>>>> and cusip_temp) have missing values except in the observation where >>>>>> runn equals cum_id_cum so I created second-step temporary variables >>>>>> (cusip_temp_2, d2_temp_2, and dt_temp_2) which place the cusip, >>>>>> datadate_2, and date_t_4 values for each particular run through the >>>>>> entire dataset. This allows me to mark the days for each firm that are >>>>>> between the dates of the earnings announcements and then sum up the >>>>>> abnormal returns in a temporary variable called cum_ab_temp. The final >>>>>> variable with the sum of the abnormal returns for each firm is cum_ab >>>>>> and is retained in the observation where _merge==3 (from the merge >>>>>> mentioned in the second paragraph of this post). >>>>>> >>>>>> My code is below. Note that I constrain it to the first 25,000 >>>>>> cum_id_cum values due to macro size constraints: >>>>>> >>>>>> gen runn= 1 >>>>>> levelsof cum_id_cum if cum_id_cum !=. & cum_id_cum <25000, local(cum_id_cum) >>>>>> foreach 1 of local cum_id_cum{ >>>>>> gen d2_temp= datadate_2 if runn== cum_id_cum >>>>>> gen dt_temp= date_t_4 if runn== cum_id_cum >>>>>> gen cusip_temp= cusip if runn== cum_id_cum >>>>>> egen cusip_temp_2= mode(cusip_temp) >>>>>> egen d2_temp_2= min(d2_temp) >>>>>> egen dt_temp_2= min(dt_temp) >>>>>> foreach x of varlist date{ >>>>>> replace mark= 1 if `x' <= d2_temp_2 & `x' >= dt_temp_2 & cusip_temp_2==cusip >>>>>> egen cum_ab_temp= total(ab_ret) if mark==1 >>>>>> replace cum_ab= cum_ab_temp if datadate_2== d2_temp & dt_temp== >>>>>> date_t_4 & cusip_temp==cusip & d2_temp !=. & dt_temp !=. >>>>>> replace runn= runn+1 >>>>>> drop d2_temp >>>>>> drop dt_temp >>>>>> drop d2_temp_2 >>>>>> drop dt_temp_2 >>>>>> drop cusip_temp >>>>>> drop cusip_temp_2 >>>>>> drop cum_ab_temp >>>>>> replace mark=0 >>>>>> >>>>>> } >>>>>> } >>>>>> >>>>>> I'm wondering if there is a more efficient way to do the above which >>>>>> will result in a significantly faster processing time. My fear is that >>>>>> the above ignores the functionality of Stata and instead uses >>>>>> inefficient brute force. > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: More efficient processing of nested loops?***From:*Robson Glasscock <glasscockrc@vcu.edu>

**Re: st: More efficient processing of nested loops?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: More efficient processing of nested loops?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: More efficient processing of nested loops?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: More efficient processing of nested loops?***From:*Robson Glasscock <glasscockrc@vcu.edu>

**Re: st: More efficient processing of nested loops?***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**RE: st: display graphs on log scale** - Next by Date:
**st: New version of -scheme_rbn1mono- on SSC** - Previous by thread:
**Re: st: More efficient processing of nested loops?** - Next by thread:
**st: create a variable with estimated coefficients on dummies** - Index(es):