Nian Huang <huangnian@gmail.com> is working on event study which is taking
longer than desired. The code
forvalues i=1(1)1314 {
quietly reg ret mkt if id==`i' & est_window==1
quietly predict p if id==`i'
quietly replace pred_rtn= p if id==`i' & event_window==1
quietly drop p
}
took about 30 minutes to run.
The above code is fine. It is the easiest way to solve the problem at hand,
but it is not the most efficient.
There are two inefficiencies that Nian can address. First, when Stata sees
an -if- qualifier it must still pass through the excluded observations.
Given the nature of financial event studies, this means that Stata must pass
through many unused observations. The solution for this problem is to use
-in- instead -if- qualifiers. Second, generating and dropping the variable
p for each id can be very expensive. Again, the nature of event studies
implies that simply replacing the relevant observations in an existing
variable would produce large time savings.
Here is a quick solution.
Before presenting my solution, here is how I generated some simulated data.
Note that I am only using 300 panels in my example. I recommend that Nian
try out versions of this solution with a few hundred panels before moving on
to the whole sample. I also note that this simulated data almost certainly
has a simpler structure than Nian's data.
Here is how I generated the data.
// ********************** Begin generate data ************************
clear
set mem 500m
set rmsg on
local N = 300
local T = 1000
set obs `N'
gen id = _n
expand `T'
sort id
by id: gen t = _n
sort id t
gen double mkt = uniform()
gen double ret = ln(`i'+1)*mkt + uniform()
by id: gen first = int(800*uniform())
by id: gen last = int(200*uniform()) + first
gen byte event_window = t>=first & t<=last
// ************************ End generate data ***********************
When I ran the simple solution on this data I obtained
. gen pred_rtn = .
(300000 missing values generated)
r; t=0.05 8:46:07
. forvalues i=1/`N' {
2. qui regress ret mkt if id==`i' & event_window==1
3. quietly predict p if id==`i'
4. quietly replace pred_rtn=p if id==`i' & event_window==1
5. quietly drop p
6. }
r; t=100.22 8:47:47
The more efficient solution uses -in- instead of -if- and I use -matrix
score- with the -replace- option instead of -predict-.
To identify the observations to be included for each id, I sort the data,
generate a variable that has observation numbers of the included
observations, and due a summary of this variable for each id. (There are
faster solutions, but this one is relatively straight forward.)
Instead of using -predict- to create a temporary variable for each id, I use
-matrix score ..., replace- to put the predictions directly into the
variable of interest.
Here is the code that I used.
sort id event_window t
gen n_in = _n if event_window==1
gen pred_rtn2 = .
forvalues i=1/`N' {
qui sum n_in if id==`i' & event_window==1, meanonly
local firstob = r(min)
local lastob = r(max)
qui regress ret mkt in `firstob'/`lastob'
mat b = e(b)
quietly matrix score pred_rtn2 = b in `firstob'/`lastob' , replace
}
gen double diff = reldif(pred_rtn2, pred_rtn)
sum diff
Note that I included a check that the two methods produce the same result
for my little sample. I recommend that Nian include such a check when
trying this solution out on some subsamples.
When I ran this code I obtained.
. sort id event_window t
r; t=0.94 8:47:48
. gen n_in = _n if event_window==1
(269897 missing values generated)
r; t=0.08 8:47:48
. gen pred_rtn2 = .
(300000 missing values generated)
r; t=0.05 8:47:48
. forvalues i=1/`N' {
2. qui sum n_in if id==`i' & event_window==1, meanonly
3. local firstob = r(min)
4. local lastob = r(max)
5. qui regress ret mkt in `firstob'/`lastob'
6. mat b = e(b)
7. quietly matrix score pred_rtn2 = b in `firstob'/`lastob' , replace
8. }
r; t=31.26 8:48:19
This solution runs in less than 1/3 of the time it took the original
solution. For some reasonable data structures, this fraction can be
decreasing in the number of panels.
Finally, my check passed.
. gen double diff = reldif(pred_rtn2, pred_rtn)
r; t=0.09 8:48:20
. sum diff
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
diff | 300000 0 0 0 0
r; t=0.10 8:48:20
I hope this helps Nian speed up the code.
David
ddrukker@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/