Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: insheet multi threading

From	Mike Lacy <[email protected]>
To	[email protected]
Subject	Re: st: insheet multi threading
Date	Tue, 03 May 2011 12:17:07 -0600


>Date: Mon, 2 May 2011 09:30:48 -0400

Argyn Kuketayev <[email protected]> wrote:

>I'm not talking about some obscure command either. it's a very basic
>task, and I'm sure everyone does it daily: read CSV files. it takes
>over 1 hour on 8-core machine to read 13GB file, because CPU load is
>12% all the time, one core is working.

>it's a junior programmer level assignment to parallelize the parsing
>part, that's why i'm surprised Stata didn't do it. it's frustrating
>because sometime i get CSVs during the day, and have to wait long long
>time before i can upload them into Stata. once in .dta format, all is
>fast: reading and writing. so, it's clearly parsing part that is slow

Here's an inelegant approach that might nevertheless work toparallelize your job. Whether it works depends on what seems to betrue on my dual processor machine running Windows and a singleprocessor version of Stata. It seems that a new instance of Stata,running concurrently is allocated (by Windows XP, in my case) to adifferent processor than the one running the first instance. Thisclaim is based on simultaneously running the same long job in twodifferent instances of Stata, and having it take much less than twice as long.

If this is true, and perhaps generalizes to other operating systemsand machines with more processors, you could:

1) use -chunky- (-findit chunky-) to break up your CSV file intomultiple CSV files with the same original header with variable names.2) Take the list of file names that -chunky- returns, and break itinto (say) 4 lists.3) In the current instance of Stata, start -insheeting- the files onthe first list and saving them as *.dta files.4) For each of the 3 remaining lists, start a new instance of Statato -insheet- each list.

5) Append all the *.dta files together.

This could be automated, I presume, by starting one or more otherinstances of Stata as batch jobs on your machine from its commandline; you presumably could even call these other instances of Statafrom within Stata. I freely admit that this approach is clumsy, andwould involve a fair amount of extra I/O, but it might be quite a bitfaster, if you're right that parsing is the rate-determining aspect of the job.

I think, but I'm not sure, that this does not violate the terms ofthe one user at a time licensing of Stata.


Regards,
=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy, Assoc. Prof.
Soc. Dept., Colo. State. Univ.
Fort Collins CO 80523 USA

(970)-491-6721


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: Levpet revenue versus valueadded option
Next by Date: Re: st: specifying marker labels in twoway scatter
Previous by thread: Re: st: Re: insheet multi threading
Next by thread: st: New to Stata; wish to calculate sample size for kappa
Index(es):
- Date
- Thread