Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Neil Shephard <nshephard@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: looping over files -- speed and Stata/MP |
Date | Wed, 16 Mar 2011 14:56:30 +0000 |
On Wed, Mar 16, 2011 at 2:48 PM, Dimitri Szerman <dimitrijoe@gmail.com> wrote: > Hello, > > In constructing a data set, I have to loop over hundreds of thousands > of files. Simply put, this is what I do: > > ! dir "mydir" /a-d /b > filelist.txt // list of files to be imported > file open LIST using "filelist.txt", read > file read LIST line > while r(eof)==0 { > > (a bunch of Stata commands) > > save mydir2\\`line', replace > file read LIST line > } > file close LIST > > > (In fact, I run a loop like this twice (first to import csv into dta; > another to work (clean) the dta files). As it stands now, my code > takes around 12 hours to run. My question is: will Stata/MP make it > run faster? (For those familiar with Matlab, I guess this boils down > to: does Stata/MP have something along the lines of "parfor", i.e., a > "parallel-for" command?) I suspect the biggest overhead is the I/O (read/write) to the hard-drive (even more so if you are working from a network drive), and as such Stata/MP is unlikely to provide any major benefit on that front >More broadly, can anyone think of a way of speeding this up? 1) Why loop over twice, why can't you do the cleaning after reading the file in, but before saving the file? 2) Do the files have the same structure? If so you could use some simple concatenation ('cat') of files followed by 'grep -v' to exclude the header lines. This is done with command line tools that you could call from within Stata using -!- under GNU/Linux or OSX. If you're on M$-Windows then you can get the same functionality by installing the Cygwin shell (http://www.cygwin.com/). Neil -- “Truth in science can be defined as the working hypothesis best suited to open the way to the next better one.” - Konrad Lorenz Email - nshephard@gmail.com Website - http://kimura.no-ip.org/ Photos - http://www.flickr.com/photos/slackline/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/