Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: 5 mil obs - travel time btw 2 places |
Date | Mon, 2 Dec 2013 21:51:56 +0000 |
Quite so. You selected observations if <whatever> and then took a mean over those observations. Result is a single number, necessarily, regardless of <whatever>, so long as at least one observation is selected. egen avgtime = mean(crselapsedtime), by(origin destination) would give separate means for origin-destination pairs. Nick njcoxstata@gmail.com On 2 December 2013 19:43, Coleman, Greg <greg.coleman@emc.com> wrote: > Thanks Nick - before I saw your note, I did try this; > > sort origin dest > > . egen avgtime=mean(crselapsedtime) if origin==origin[_n-1] & dest==dest[_n-1] > (3535 missing values generated) > > BUT, the new var avgtime was the same for every single observation. > > > > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox > Sent: Monday, December 02, 2013 1:33 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: 5 mil obs - travel time btw 2 places > > When you say "unique", you mean "distinct". On average, these "unique" > pairs occur about 25,000 times each, not once. > > You end with the idea of a 'collapse'. Exactly! > > I'd start with looking at -contract- and -collapse- commands. > > As a footnote, look also at -groups- (SSC). > > Nick > njcoxstata@gmail.com > > > On 2 December 2013 18:24, Coleman, Greg <greg.coleman@emc.com> wrote: >> Hi Stata gurus - >> >> A pretty large data set (for me!) where there are just over 5m obs. Its flight data, where there are 29 variables. >> 2 of the variables are origin, dest. I am struggling with coming up >> with various statistics when these 2 are the same, meaning all the rows where origin=JFK and dest=SFO. (example) For instance, count the number of times they occurred (how many flights from JFK to SFO overall), the travel time for each of the trips that occurred, which day of the week is typically prone to delays going to SFO from JFK, etc etc. >> >> Can someone give me a hint on how to approach this? I tried foreach loops, while loops, using "by()", but I feel like I am not on track to an efficient method. >> There are over 200 unique origin and dest throughout the 5m obs, so anyway I can 'collapse' this data so I can makes some graphs would also be great. >> >> Thanks! >> Greg >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/