Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

st: RE: How to perfom very simple manipulations in large data sets more efficiently

 From "Tiago V. Pereira" To statalist@hsphsun2.harvard.edu Subject st: RE: How to perfom very simple manipulations in large data sets more efficiently Date Tue, 16 Aug 2011 10:29:44 -0300 (BRT)

```Billy,

Extremely helpful tips! Thanks a lot!

Cheers!

Tiago

--
I have encountered this problem a lot, and responses from Statalist
(including Nick) helped me a few weeks ago.

Using the summarize command should be quite fast because it's a
built-in (machine-implemented) command, but when I have typically
needed to find the value of X corresponding to the smallest value of Y
(as in your example) I typically have to do it over a grouping
variable using -by-.

This brings me to my first suggestion: If there is a way to reduce
your 10,000 repetitions to one pass by marking out each part of the
dataset you need to repeat in with a grouping variable (check out
-egen- and its group function), using -by- can solve most of your
problems in one fell swoop.

Next, the real algorithmic question here is how to identify an minimum
value. That's what Statalisters helped with a few weeks ago. Using
egen's min function uses your "simple approach 2": sort and take the
the value of Y[1]. This is SLOW. Better to find the minimum manually.
The example below uses -by- but you can do precisely the same thing by
dropping the -by- syntax.

/* example 1 */
clonevar minY = Y
/* by groupid: replace this value of minY with the previous one if the
previous one is less or the previous one is non missing and this one
is missing. no replacements if _n == 1 because minY[0] == . always and
minY > . sometimes */
by groupid, sort: replace minY = minY[_n-1] if minY[_n-1] < minY |
(minY[_n-1] < . & minY >= .) & _n > 1
by groupid: keep if Y == minY[_N]

Without a by-grouping you can also add the local command you had
before, as follows:

/*example 2*/
clonevar minY = Y
replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . &
minY >= .)  & _n > 1
keep if Y == minY[_N]
local minY = minY[_N]

Finally, if you're sure you have no missing values in Y, you can
simplify the -replace- syntax as follows

/* example 3 fragment simplified */
replace minY = minY[_n-1] if minY[_n-1] < minY

--
I thank Stas and Nick for their helpful comments on my last query.

All the best

Tiago

--
Dear statalisters,

I have to perform extremely simple tasks, but I am struggling with the low
efficiency of my dummy implementations. Perhaps you might have smarter
ideas.

Here is an example:

Suppose I have two variables, X and Y.

I need to the get value of Y that is associated with the smallest value of X.

What I usually do is:

(1) simple approach 1

*/ ------ start --------
sum X, meanonly
keep if X==r(min)
local my_value = Y[1]
*/ ------ end --------

(2) simple approach 2

*/ ------ start --------
sort X
local my_value = Y[1]
*/ ------ end --------

These approaches are simple, and work very well for small data sets. Now,
I have to repeat that procedure 10k times, for data sets that range from
500k to 1000k observations. Hence, both procedures 1 and 2 become clearly
slow.

If you have any tips, I will be very grateful.

All the best,

Tiago

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```