Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
George Vega Yon <g.vegayon@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Improving code speed |

Date |
Thu, 23 May 2013 13:50:32 -0400 |

Dear Luis, You can do both, translating part of your program into mata and running it in parallel fashion. I recommend you to start by rewriting the subloop, it could be something like this (is just an example), following the previous example that I sent ______________ MATA ______________ cap mata: mata drop myfun() mata: real colvector myfun(real scalar reps, real colvector y, real colvector z) { real scalar i, tsum real colvector x for(i=1;i<=reps;i++) { x = z :+ rnormal(rows(z),1,0,1) y = y :+ ((x):^2):/ sum(x) } return(y) } end ______________ MATA ______________ ______________ STATA ______________ forvalues k=`=id[1]'(1)`=id[_N]'{ gen subs=(id<=`k') // Define the subsample to be used gen Y`k'=0 // gen the intermediate Y`k' mata: st_store(1::`k', "Y`k'", myfun($reps, st_data(1::`k', "Y`k'"), st_data(1::`k',"z"))) replace Y`k'=Y`k'/$reps // average Y from the 100 simulations replace Y= Y + Y`k' drop Y`k' subs } ______________ STATA ______________ Mata is fast, take this for example, two algorithms that generate the same thing take quite different times to finish ____________________________ timer clear // Stata loop + Substata loop timer on 1 forval j = 1/1000 { forval i = 1/1000 { scalar x = sqrt(exp(1/c(pi))) + sqrt(exp(1/c(pi))) } } timer off 1 // equals 2.3450392 di x // Stata loop + mata loop timer on 2 forval j = 1/1000 { mata: for(i=1;i<=1000;i++) x = sqrt(exp(1/c("pi"))) + sqrt(exp(1/c("pi"))) } timer off 2 // equals 2.3450392 mata:x // In this example, the second version took ~ 1/12 of the time the first did . timer list 1: 8.32 / 1 = 8.3240 2: 0.72 / 1 = 0.7220 ____________________________ You will save a lot of time! Best regards, George Vega Yon 7 647 2552 http://cl.linkedin.com/in/georgevegayon 2013/5/23 Nick Cox <njcoxstata@gmail.com>: > Thanks for confirmation of mistakes. > > So, here is an extra comment as you are concerned about cleaning up > your code and speeding it up. I can see no reason at all for copying > variable -epsilon- to matrix -epsilon- using -mkmat-. You have a > variable -epsilon- that should be fine for your purpose. In fact that > was I suggested yesterday. > > If you post revised, complete and correct code you may get further > comments, but no definite promises from me. > Nick > njcoxstata@gmail.com > > > On 23 May 2013 09:01, Luis Aguiar <stataluis@gmail.com> wrote: >> Hi George and Nick, >> >> Thanks a lot for your responses. >> >> Nick: Yes, the code works but there are indeed a couple of mistakes in >> the one I copied. I actually use the command " mkmat epsilon if >> id<=`reps' " after generating the variable epsilon (id is indeed _n in >> a variable). The second line in your comment (b) should read " replace >> x`i'=z + epsilon[`i',1] if id==`k' " . Sorry about all that. Thanks >> for your helpful comments though, I will try to incorporate them into >> my code along with George's comments in order to speed things up. >> >> George: Thanks for your suggestions as well. I wasn't sure if it would >> be worth going into mata, but I will try it now. Your parallel code >> seems very interesting too. Do you think it would go faster than using >> mata? >> >> Again, thanks a lot to both of you! >> >> Cheers, >> Luis >> >> 2013/5/22 Nick Cox <njcoxstata@gmail.com>: >>> Please use your full real name. See Statalist FAQ for that request and why. >>> >>> Some speed-ups are likely to be possible here, but I first I note >>> several puzzles with this code. >>> >>> You don't say so, but presumably -id- is _n in a variable. >>> >>> -z- is unexplained. >>> >>> More problematically, >>> >>> (a) You -generate- a variable -epsilon- but you refer to a matrix -epsilon-. >>> >>> (b) The two lines below won't work as once -x`i'- exists the second >>> command will fail. >>> >>> gen x`i'=z \\ generate simulated variable >>> gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part >>> >>> Why do you say that it works? Did you copy some buggy version by accident? >>> >>> Better to post self-contained code that works. >>> >>> Speed-ups, apart from using Mata. (See George Vega Yon's post.) >>> >>> 1. Too much copying from one variable to another. I could be wrong, >>> but some variables appear to be mostly zero, and you are just copying >>> constants. Think in terms of scalars instead. >>> >>> 2. Use -summarize, meanonly- to get sums. -egen- is very slow at this. >>> >>> 3. Use -in 1/`k'- or -in `k'- wherever possible. Whenever there is a >>> choice between -if- and -in- for the same problem, -in- is faster. >>> >>> Some example code: >>> >>> gen x`i'=z >>> replace x`i'= x`i' + epsilon[`i'] in `k' >>> >>> su x`i' in 1/`k', meanonly >>> replace Y`k' = (x`i')^2/r(sum) in `k' >>> >>> >>> Nick >>> njcoxstata@gmail.com >>> >>> >>> On 22 May 2013 18:25, Luis <stataluis@gmail.com> wrote: >>>> Dear statalist users, >>>> >>>> I am running into a "loop efficiency problem" in that I have to >>>> construct a variable using many iterations and I am not sure whether I >>>> am being as efficient as possible. Given the number of observations >>>> that I have and with my current code, I have to wait days for my code >>>> to finish running! Here's my problem: >>>> >>>> I have a total of 50000 observations and need to construct a variable >>>> Y that will be computed using different subsamples of these >>>> observations. In particular, >>>> Y=Y1 when the subsample contains only the first observation, >>>> Y=Y2 when the subsample contains observations 1 and 2, >>>> Y=Y3 when the subsample contains observations 1, 2 and 3 etc until >>>> Y=Y50000. >>>> >>>> The idea is therefore to loop over the sample and define the subsample >>>> which contains observations 1 until k and construct the variable >>>> Y`k'=Yk if id==k and Y`k'=0 if id!=k. Then sum the variables Y`k' >>>> after each loop to end up with the final variable Y. >>>> >>>> To further complicate things, the variable Y needs to be the average >>>> of 100 simulations that depend on draws taken from a normal >>>> distribution. Hence I need to do a loop within the initial loop in >>>> order to do the 100 simulations. >>>> >>>> My code therefore looks like this: >>>> >>>> _____________________________________________________________________________________ >>>> >>>> gen Y=0 >>>> >>>> local reps=100 \\ define the number of simulations >>>> >>>> gen epsilon=rnormal() \\ generate the random var for the simulations >>>> >>>> forvalues k=1(1)50000{ >>>> >>>> gen subs=(id<=`k') \\ Define the subsample to be used >>>> gen Y`k'=0 \\ gen the intermediate Y`k' >>>> >>>> forvalues i=1(1)`reps'{ >>>> >>>> gen x`i'=z \\ generate simulated variable >>>> gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part >>>> >>>> gen t=(x`i')^2 >>>> bysort subs: egen tsum=sum(x`i') >>>> >>>> gen Y_`i'=t/tsum if id ==`k' \\ Construct Y for simulation i >>>> replace Y_`i'=0 if id!=`k' >>>> >>>> replace Y`k'=Y`k' + Y_`i' >>>> replace Y`k'=0 if id!=`k' >>>> >>>> drop Y_`i' t tsum x`i' >>>> } >>>> >>>> replace Y`k'=Y`k'/`reps' // average Y from the 100 simulations >>>> replace Y= Y + Y`k' >>>> drop Y`k' subs >>>> } >>>> >>>> ____________________________________________________________________________________ >>>> >>>> >>>> The code runs fine, but I takes a lot of time since it has to >>>> construct 100 variables for each of the 50000 iterations. I have tried >>>> many different possibilities and I can't think of another way of >>>> constructing Y. >>>> >>>> Any tip or suggestion that would help improve the efficiency of my >>>> code would be greatly appreciated!!! >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Improving code speed***From:*Luis <stataluis@gmail.com>

**Re: st: Improving code speed***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Improving code speed***From:*Luis Aguiar <stataluis@gmail.com>

**Re: st: Improving code speed***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: RE: Mlogit for Panel Data** - Next by Date:
**Re: st: Need to Split String Variable** - Previous by thread:
**Re: st: Improving code speed** - Next by thread:
**st: Fwd: changing the format of the current date** - Index(es):