Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Billy Schwartz <wkschwartz@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: How to perfom very simple manipulations in large data sets more efficiently |

Date |
Mon, 15 Aug 2011 15:45:57 -0400 |

I have encountered this problem a lot, and responses from Statalist (including Nick) helped me a few weeks ago. Using the summarize command should be quite fast because it's a built-in (machine-implemented) command, but when I have typically needed to find the value of X corresponding to the smallest value of Y (as in your example) I typically have to do it over a grouping variable using -by-. This brings me to my first suggestion: If there is a way to reduce your 10,000 repetitions to one pass by marking out each part of the dataset you need to repeat in with a grouping variable (check out -egen- and its group function), using -by- can solve most of your problems in one fell swoop. Next, the real algorithmic question here is how to identify an minimum value. That's what Statalisters helped with a few weeks ago. Using egen's min function uses your "simple approach 2": sort and take the the value of Y[1]. This is SLOW. Better to find the minimum manually. The example below uses -by- but you can do precisely the same thing by dropping the -by- syntax. /* example 1 */ clonevar minY = Y /* by groupid: replace this value of minY with the previous one if the previous one is less or the previous one is non missing and this one is missing. no replacements if _n == 1 because minY[0] == . always and minY > . sometimes */ by groupid, sort: replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . & minY >= .) & _n > 1 by groupid: keep if Y == minY[_N] Without a by-grouping you can also add the local command you had before, as follows: /*example 2*/ clonevar minY = Y replace minY = minY[_n-1] if minY[_n-1] < minY | (minY[_n-1] < . & minY >= .) & _n > 1 keep if Y == minY[_N] local minY = minY[_N] Finally, if you're sure you have no missing values in Y, you can simplify the -replace- syntax as follows /* example 3 fragment simplified */ replace minY = minY[_n-1] if minY[_n-1] < minY On Mon, Aug 15, 2011 at 12:57 PM, Tiago V. Pereira <tiago.pereira@mbe.bio.br> wrote: > > I thank Stas and Nick for their helpful comments on my last query. > > All the best > > Tiago > > -- > Dear statalisters, > > I have to perform extremely simple tasks, but I am struggling with the low > efficiency of my dummy implementations. Perhaps you might have smarter > ideas. > > Here is an example: > > Suppose I have two variables, X and Y. > > I need to the get value of Y that is associated with the smallest value of X. > > What I usually do is: > > (1) simple approach 1 > > */ ------ start -------- > sum X, meanonly > keep if X==r(min) > local my_value = Y[1] > */ ------ end -------- > > (2) simple approach 2 > > */ ------ start -------- > sort X > local my_value = Y[1] > */ ------ end -------- > > These approaches are simple, and work very well for small data sets. Now, > I have to repeat that procedure 10k times, for data sets that range from > 500k to 1000k observations. Hence, both procedures 1 and 2 become clearly > slow. > > If you have any tips, I will be very grateful. > > All the best, > > Tiago > > > > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How to perfom very simple manipulations in large data sets more efficiently***From:*"Tiago V. Pereira" <tiago.pereira@mbe.bio.br>

- Prev by Date:
**st: main effect OR when interaction present - xtlogit-** - Next by Date:
**Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)** - Previous by thread:
**st: How to perfom very simple manipulations in large data sets more efficiently** - Next by thread:
**st: Parallel Stata MP Sessions on 32 Processor Workstation** - Index(es):