Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Improving code speed

 From Nick Cox To "statalist@hsphsun2.harvard.edu" Subject Re: st: Improving code speed Date Wed, 22 May 2013 19:13:39 +0100

```Please use your full real name. See Statalist FAQ for that request and why.

Some speed-ups are likely to be possible here, but I first I note
several puzzles with this code.

You don't say so, but presumably -id- is _n in a variable.

-z- is unexplained.

More problematically,

(a) You -generate- a variable -epsilon- but you refer to a matrix -epsilon-.

(b) The two lines below won't work as once -x`i'- exists the second
command will fail.

gen x`i'=z \\ generate simulated variable
gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part

Why do you say that it works? Did you copy some buggy version by accident?

Better to post self-contained code that works.

Speed-ups, apart from using Mata. (See George Vega Yon's post.)

1. Too much copying from one variable to another. I could be wrong,
but some variables appear to be mostly zero, and you are just copying
constants. Think in terms of scalars instead.

2. Use -summarize, meanonly- to get sums. -egen- is very slow at this.

3. Use -in 1/`k'- or -in `k'- wherever possible. Whenever there is a
choice between -if- and -in- for the same problem, -in- is faster.

Some example code:

gen x`i'=z
replace x`i'= x`i' + epsilon[`i'] in `k'

su x`i' in 1/`k', meanonly
replace Y`k' = (x`i')^2/r(sum) in `k'

Nick
njcoxstata@gmail.com

On 22 May 2013 18:25, Luis <stataluis@gmail.com> wrote:
> Dear statalist  users,
>
> I am running into a "loop efficiency problem" in that I have to
> construct a variable using many iterations and I am not sure whether I
> am being as efficient as possible. Given the number of observations
> that I have and with my current code, I have to wait days for my code
> to finish running! Here's my problem:
>
> I have a total of 50000 observations and need to construct a variable
> Y that will be computed using different subsamples of these
> observations. In particular,
> Y=Y1 when the subsample contains only the first observation,
> Y=Y2 when the subsample contains observations 1 and 2,
> Y=Y3 when the subsample contains observations 1, 2 and 3 etc until
> Y=Y50000.
>
> The idea is therefore to loop over the sample and define the subsample
> which contains observations 1 until k and construct the variable
> Y`k'=Yk if id==k and Y`k'=0 if id!=k. Then sum the variables Y`k'
> after each loop to end up with the final variable Y.
>
> To further complicate things, the variable Y needs to be the average
> of 100 simulations that depend on draws taken from a normal
> distribution. Hence I need to do a loop within the initial loop in
> order to do the 100 simulations.
>
> My code therefore looks like this:
>
> _____________________________________________________________________________________
>
> gen Y=0
>
> local reps=100 \\ define the number of simulations
>
> gen epsilon=rnormal() \\ generate the random var for the simulations
>
> forvalues k=1(1)50000{
>
> gen subs=(id<=`k')   \\ Define the subsample to be used
> gen Y`k'=0      \\ gen the intermediate Y`k'
>
>         forvalues i=1(1)`reps'{
>
>                 gen x`i'=z \\ generate simulated variable
>                 gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part
>
>         gen t=(x`i')^2
>         bysort subs: egen tsum=sum(x`i')
>
>         gen Y_`i'=t/tsum if id ==`k' \\ Construct Y for simulation i
>         replace Y_`i'=0 if id!=`k'
>
>         replace Y`k'=Y`k' + Y_`i'
>                 replace Y`k'=0 if id!=`k'
>
>         drop Y_`i' t tsum x`i'
>         }
>
> replace Y`k'=Y`k'/`reps'      // average Y from the 100 simulations
> replace Y= Y + Y`k'
> drop Y`k' subs
>         }
>
> ____________________________________________________________________________________
>
>
> The code runs fine, but I takes a lot of time since it has to
> construct 100 variables for each of the 50000 iterations. I have tried
> many different possibilities and I can't think of another way of
> constructing Y.
>
> Any tip or suggestion that would help improve the efficiency of my
> code would be greatly appreciated!!!
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```