Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: How to perfom very simple manipulations in large data sets more efficiently


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: How to perfom very simple manipulations in large data sets more efficiently
Date   Fri, 12 Aug 2011 15:56:44 +0100

This is clear but the problem nevertheless retains some ambiguity. 

First, these approaches aren't equivalent. (1) loses all the data except for those observation(s) that are equal to the minimum. (2) keeps all the data. 

Second, as the sentence above hints, in general there could be several observations that tie for minimum on X. Perhaps you are confident that this won't bite for your application. 

A variant on (1) is 

sum X, meanonly
su Y if X==r(min)

which will tell you about duplicates. 

It doesn't answer your question, but I'll record nevertheless that some of this territory was reviewed in 

SJ-11-2 dm0055  . . . . . . . . . . . . . .  Speaking Stata: Compared with ...
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q2/11   SJ 11(2):305--314                                (no commands)
        reviews techniques for relating values to values in other
        observations

As you are doing this thousands of times, I think you need to get some timings for a few sample datasets. 

Nick 
n.j.cox@durham.ac.uk 


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Tiago V. Pereira
Sent: 12 August 2011 15:43
To: statalist@hsphsun2.harvard.edu
Subject: st: How to perfom very simple manipulations in large data sets more efficiently

Dear statalisters,

I have to perform extremely simple tasks, but I am struggling with the low
efficiency of my dummy implementations. Perhaps you might have smarter
ideas.

Here is an example:

Suppose I have two variables, X and Y.

I need to the get value of Y that is associated with the smallest value of X.

What I usually do is:

(1) simple approach 1

*/ ------ start --------
sum X, meanonly
keep if X==r(min)
local my_value = Y[1]
*/ ------ end --------

(2) simple approach 2

*/ ------ start --------
sort X
local my_value = Y[1]
*/ ------ end --------

These approaches are simple, and work very well for small data sets. Now,
I have to repeat that procedure 10k times, for data sets that range from
500k to 1000k observations. Hence, both procedures 1 and 2 become clearly
slow.

If you have any tips, I will be very grateful.

All the best,

Tiago




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index