Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Improving code speed |
Date | Wed, 22 May 2013 19:13:39 +0100 |
Please use your full real name. See Statalist FAQ for that request and why. Some speed-ups are likely to be possible here, but I first I note several puzzles with this code. You don't say so, but presumably -id- is _n in a variable. -z- is unexplained. More problematically, (a) You -generate- a variable -epsilon- but you refer to a matrix -epsilon-. (b) The two lines below won't work as once -x`i'- exists the second command will fail. gen x`i'=z \\ generate simulated variable gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part Why do you say that it works? Did you copy some buggy version by accident? Better to post self-contained code that works. Speed-ups, apart from using Mata. (See George Vega Yon's post.) 1. Too much copying from one variable to another. I could be wrong, but some variables appear to be mostly zero, and you are just copying constants. Think in terms of scalars instead. 2. Use -summarize, meanonly- to get sums. -egen- is very slow at this. 3. Use -in 1/`k'- or -in `k'- wherever possible. Whenever there is a choice between -if- and -in- for the same problem, -in- is faster. Some example code: gen x`i'=z replace x`i'= x`i' + epsilon[`i'] in `k' su x`i' in 1/`k', meanonly replace Y`k' = (x`i')^2/r(sum) in `k' Nick njcoxstata@gmail.com On 22 May 2013 18:25, Luis <stataluis@gmail.com> wrote: > Dear statalist users, > > I am running into a "loop efficiency problem" in that I have to > construct a variable using many iterations and I am not sure whether I > am being as efficient as possible. Given the number of observations > that I have and with my current code, I have to wait days for my code > to finish running! Here's my problem: > > I have a total of 50000 observations and need to construct a variable > Y that will be computed using different subsamples of these > observations. In particular, > Y=Y1 when the subsample contains only the first observation, > Y=Y2 when the subsample contains observations 1 and 2, > Y=Y3 when the subsample contains observations 1, 2 and 3 etc until > Y=Y50000. > > The idea is therefore to loop over the sample and define the subsample > which contains observations 1 until k and construct the variable > Y`k'=Yk if id==k and Y`k'=0 if id!=k. Then sum the variables Y`k' > after each loop to end up with the final variable Y. > > To further complicate things, the variable Y needs to be the average > of 100 simulations that depend on draws taken from a normal > distribution. Hence I need to do a loop within the initial loop in > order to do the 100 simulations. > > My code therefore looks like this: > > _____________________________________________________________________________________ > > gen Y=0 > > local reps=100 \\ define the number of simulations > > gen epsilon=rnormal() \\ generate the random var for the simulations > > forvalues k=1(1)50000{ > > gen subs=(id<=`k') \\ Define the subsample to be used > gen Y`k'=0 \\ gen the intermediate Y`k' > > forvalues i=1(1)`reps'{ > > gen x`i'=z \\ generate simulated variable > gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part > > gen t=(x`i')^2 > bysort subs: egen tsum=sum(x`i') > > gen Y_`i'=t/tsum if id ==`k' \\ Construct Y for simulation i > replace Y_`i'=0 if id!=`k' > > replace Y`k'=Y`k' + Y_`i' > replace Y`k'=0 if id!=`k' > > drop Y_`i' t tsum x`i' > } > > replace Y`k'=Y`k'/`reps' // average Y from the 100 simulations > replace Y= Y + Y`k' > drop Y`k' subs > } > > ____________________________________________________________________________________ > > > The code runs fine, but I takes a lot of time since it has to > construct 100 variables for each of the 50000 iterations. I have tried > many different possibilities and I can't think of another way of > constructing Y. > > Any tip or suggestion that would help improve the efficiency of my > code would be greatly appreciated!!! * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/