Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: looping over files -- speed and Stata/MP


From   Neil Shephard <[email protected]>
To   [email protected]
Subject   Re: st: looping over files -- speed and Stata/MP
Date   Wed, 16 Mar 2011 14:56:30 +0000

On Wed, Mar 16, 2011 at 2:48 PM, Dimitri Szerman <[email protected]> wrote:
> Hello,
>
> In constructing a data set, I have to loop over hundreds of thousands
> of files. Simply put, this is what I do:
>
> ! dir "mydir" /a-d /b > filelist.txt         // list of files to be imported
> file open LIST using "filelist.txt", read
> file read LIST line
> while r(eof)==0 {
>
>     (a bunch of Stata commands)
>
> save mydir2\\`line', replace
> file read LIST line
> }
> file close LIST
>
>
> (In fact, I run a loop like this twice (first to import csv into dta;
> another to work (clean) the dta files). As it stands now, my code
> takes around 12 hours to run. My question is: will Stata/MP make it
> run faster? (For those familiar with Matlab, I guess this boils down
> to: does Stata/MP have something along the lines of "parfor", i.e., a
> "parallel-for" command?)

I suspect the biggest overhead is the I/O (read/write) to the
hard-drive (even more so if you are working from a network drive), and
as such Stata/MP is unlikely to provide any major benefit on that
front

>More broadly, can anyone think of a way of speeding this up?

1) Why loop over twice, why can't you do the cleaning after reading
the file in, but before saving the file?

2) Do the files have the same structure?  If so you could use some
simple concatenation ('cat') of files followed by 'grep -v' to exclude
the header lines.  This is done with command line tools that you could
call from within Stata using -!- under GNU/Linux or OSX.  If you're on
M$-Windows then you can get the same functionality by installing the
Cygwin shell (http://www.cygwin.com/).

Neil


-- 
“Truth in science can be defined as the working hypothesis best suited
to open the way to the next better one.” - Konrad Lorenz

Email - [email protected]
Website - http://kimura.no-ip.org/
Photos - http://www.flickr.com/photos/slackline/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index