From
Tiago V. Pereira

To
statalist@hsphsun2.harvard.edu

Subject
st: RE: How to perfom very simple manipulations in large data sets more efficiently

Date
Tue, 16 Aug 2011

Billy, Extremely helpful tips! Thanks a lot! Cheers! Tiago -- I have encountered this problem a lot, and responses from Statalist (including Nick) helped me a few weeks ago. Using the summarize command should be quite fast because it's a built-in (machine-implemented) command, but when I have typically needed to find the value of X corresponding to the smallest value of Y (as in your example) I typically have to do it over a grouping variable using -by-. This brings me to my first suggestion: If there is a way to reduce your 10,000 repetitions to one pass by marking out each part of the dataset you need to repeat in with a grouping variable (check out -egen- and its group function), using -by- can solve most of your problems in one fell swoop. Next, the real algorithmic question here is how to identify an minimum value. That's what Statalisters helped with a few weeks ago. Using egen's min function uses your "simple approach 2": sort and take the the value of Y[1]. This is SLOW. Better to find the minimum manually. The example below uses -by- but you can do precisely the same thing by dropping the -by- syntax. /* example 1 */ clonevar minY = Y /* by groupid: replace this value of minY with the previous one if the previous one is less or the previous one is non missing and this one is missing. no replacements if _n == 1 because minY[0] == . always and minY > . sometimes */ by groupid, sort: replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . & minY >= .) & _n > 1 by groupid: keep if Y == minY[_N] Without a by-grouping you can also add the local command you had before, as follows: /*example 2*/ clonevar minY = Y replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . & minY >= .) & _n > 1 keep if Y == minY[_N] local minY = minY[_N] Finally, if you're sure you have no missing values in Y, you can simplify the -replace- syntax as follows /* example 3 fragment simplified */ replace minY = minY[_n-1] if minY[_n-1] < minY -- I thank Stas and Nick for their helpful comments on my last query. All the best Tiago -- Dear statalisters, I have to perform extremely simple tasks, but I am struggling with the low efficiency of my dummy implementations. Perhaps you might have smarter ideas. Here is an example: Suppose I have two variables, X and Y. I need to the get value of Y that is associated with the smallest value of X. What I usually do is: (1) simple approach 1 */ ------ start -------- sum X, meanonly keep if X==r(min) local my_value = Y[1] */ ------ end -------- (2) simple approach 2 */ ------ start -------- sort X local my_value = Y[1] */ ------ end -------- These approaches are simple, and work very well for small data sets. Now, I have to repeat that procedure 10k times, for data sets that range from 500k to 1000k observations. Hence, both procedures 1 and 2 become clearly slow. If you have any tips, I will be very grateful. All the best, Tiago * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

