Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <n.j.cox@durham.ac.uk> |
To | "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: looping over files -- speed and Stata/MP |
Date | Wed, 16 Mar 2011 15:51:47 +0000 |
You could well be right. I guess I didn't quite believe that Dimitri was being literal. Nick n.j.cox@durham.ac.uk Austin Nichols I'm not sure "hundreds of thousands" of files is compatible with -fs- and macro storage limits. Dimitri Szerman <dimitrijoe@gmail.com>: Short answer: no, there is no large-grain parallelization in MP. Probably the biggest speed improvement would come from revisions in the part you have hidden as "(a bunch of Stata commands)" but you can get large-grain parallelization by writing out a bunch of do files and starting a new Stata instance to run each in batch mode. Something like this untested example: ! dir "mydir" /a-d /b > filelist.txt file open LIST using "filelist.txt", read loc i 1 file read LIST line while r(eof)==0 { loc j=1000-1+`i'-mod(`i'-1,1000) file open p`j' using p`j'.do, write replace file write p`j' " (a bunch of Stata commands)" file write p`j' _n "save mydir2/`line', replace" _n "exit" file close p`j' file read LIST line if `i'==`j' winexec stata -b p`j'.do } file close LIST On Wed, Mar 16, 2011 at 11:13 AM, Nick Cox <njcoxstata@gmail.com> wrote: > -fs- from SSC automates the production of a list of files. It is just > a wrapper for a standard Stata extended macro function but it would > obviate the need for a structure based on holding a file open while > doing lots of other things. At the same time, it is difficult to know > how much difference to timings that would make beyond reducing your > use of the OS. > > Nick > > On Wed, Mar 16, 2011 at 2:48 PM, Dimitri Szerman <dimitrijoe@gmail.com> wrote: > >> In constructing a data set, I have to loop over hundreds of thousands >> of files. Simply put, this is what I do: >> >> ! dir "mydir" /a-d /b > filelist.txt // list of files to be imported >> file open LIST using "filelist.txt", read >> file read LIST line >> while r(eof)==0 { >> >> (a bunch of Stata commands) >> >> save mydir2\\`line', replace >> file read LIST line >> } >> file close LIST >> >> >> (In fact, I run a loop like this twice (first to import csv into dta; >> another to work (clean) the dta files). As it stands now, my code >> takes around 12 hours to run. My question is: will Stata/MP make it >> run faster? (For those familiar with Matlab, I guess this boils down >> to: does Stata/MP have something along the lines of "parfor", i.e., a >> "parallel-for" command?) More broadly, can anyone think of a way of >> speeding this up? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/