Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <n.j.cox@durham.ac.uk> |

To |
"'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: How to perfom very simple manipulations in large data sets more efficiently |

Date |
Fri, 12 Aug 2011 15:56:44 +0100 |

This is clear but the problem nevertheless retains some ambiguity. First, these approaches aren't equivalent. (1) loses all the data except for those observation(s) that are equal to the minimum. (2) keeps all the data. Second, as the sentence above hints, in general there could be several observations that tie for minimum on X. Perhaps you are confident that this won't bite for your application. A variant on (1) is sum X, meanonly su Y if X==r(min) which will tell you about duplicates. It doesn't answer your question, but I'll record nevertheless that some of this territory was reviewed in SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q2/11 SJ 11(2):305--314 (no commands) reviews techniques for relating values to values in other observations As you are doing this thousands of times, I think you need to get some timings for a few sample datasets. Nick n.j.cox@durham.ac.uk -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Tiago V. Pereira Sent: 12 August 2011 15:43 To: statalist@hsphsun2.harvard.edu Subject: st: How to perfom very simple manipulations in large data sets more efficiently Dear statalisters, I have to perform extremely simple tasks, but I am struggling with the low efficiency of my dummy implementations. Perhaps you might have smarter ideas. Here is an example: Suppose I have two variables, X and Y. I need to the get value of Y that is associated with the smallest value of X. What I usually do is: (1) simple approach 1 */ ------ start -------- sum X, meanonly keep if X==r(min) local my_value = Y[1] */ ------ end -------- (2) simple approach 2 */ ------ start -------- sort X local my_value = Y[1] */ ------ end -------- These approaches are simple, and work very well for small data sets. Now, I have to repeat that procedure 10k times, for data sets that range from 500k to 1000k observations. Hence, both procedures 1 and 2 become clearly slow. If you have any tips, I will be very grateful. All the best, Tiago * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How to perfom very simple manipulations in large data sets more efficiently***From:*"Tiago V. Pereira" <tiago.pereira@mbe.bio.br>

- Prev by Date:
**RE: st: Repeated measured analysis** - Next by Date:
**st: Time Series: Drift or No Drift** - Previous by thread:
**Re: st: How to perfom very simple manipulations in large data sets more efficiently** - Next by thread:
**st: Time Series: Drift or No Drift** - Index(es):