Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -project- and big data files slowing down -build-
Robert Picard <firstname.lastname@example.org>
Re: st: -project- and big data files slowing down -build-
Wed, 24 Jul 2013 11:53:27 -0400
Glad to hear that commenting out the dependency works for you. There
is indeed a cost in doing so in that the file will not appear in the
various listings and will be swept away if you use the -cleanup-
option to remove the files that are not part of the project from the
With respect to the -checksum- calls, the first thing to understand is
that -checksum- is called only once per build per file. So if your big
file is used in 20 different do-files, the -checksum- is only computed
once. In terms of order of checking for changes, as I explained
earlier, it is done from the smallest to the largest file. This
process happens in the order do-files are executed (or skipped). The
master do-file inherits the dependencies of all do-files in the
project. Therefore any change to any file in the project will
immediately trigger its execution. When -project- encounters the next
-project, do()- statement, it checks the do-file's dependencies to
determine if it has to be run again. Like for the master do-file, any
do-file inherits the dependencies of do-files nested within it and as
before, the check is done from smallest to largest files. Any checksum
already computed since the start of the build is reused. And so on...
So there is certainly overhead in using -project- that you would not
have otherwise. When working with large files, this overhead remains
low compared to all the other tasks (analysis and estimation) you are
likely to do. Some projects I work with take more than 24 hours to
perform a full replication build. They do include gigabyte-size files
and yet performing a build on them can take me to any do-file I'm
working on (just edited) i a few seconds at most.
On Wed, Jul 24, 2013 at 10:40 AM, Roberto Ferrer <email@example.com> wrote:
> Thank you for your reply, Robert.
> Let me see if I understand correctly:
> -project- checks if any do-file was modified. If it finds one, then it
> runs that do-file and that of any dependency. In this case, there is
> no need to -checksum- data files associated with the do-files. But if
> no do-file was modified, then it goes on with the -checksum- of every
> file in the project to verify this.
> I'm still confused on the order of the file checking. Does it check
> all do-files and then all data files? Does it check only considering
> size and independently of type? Or does it do something else?
> Probably the best way to figure out all this stuff out is looking at
> the source file for -project- but it will probably take me some time
> to understand. I plan on doing so in the future.
> As for the -ignore_chsum- option I mentioned in my previous email, I
> still think it would be useful. It can be a bit dangerous but it gives
> the user more power. It gives the opportunity to include a file in the
> build process although the file is not expected to change (e.g.
> because it is in a write-protected directory). It would be good for
> documenting purposes since -project, list(build)- no longer mentions
> the file if we comment out the -project, original()- and it would keep
> everything in a unique project. I would just flag it somehow (a *
> maybe) in the listing output.
> As to my solution for now, I'll follow your advice and do exactly
> that: comment out -project, original()- for the big file. I'd rather
> do that than create a new project for a simple do-file.
> On Wed, Jul 24, 2013 at 3:23 AM, Robert Picard <firstname.lastname@example.org> wrote:
>> There's a lot of things to think about when dealing with big datasets.
>> -project- indeed uses -checksum- to check for changes in dependencies.
>> It tries to be smart about it by checking files in increasing order of
>> file size. You are correct however that if there are no changes in any
>> file, -project- will have to run a -checksum- on your large file to
>> confirm that. But I can assure you that if any of the do-files have
>> changed, your master do-file will start running before you can blink.
>> Personally, I would not tolerate such slowdown either. The simplest
>> solution is to not declare a dependency for this large dataset. You
>> don't need a -ignore_chsum- option, just comment the -project,
>> original()- statement. The downside is that -project- won't be able to
>> notice changes in the large dataset and react accordingly. It kind of
>> defeats the point of -project- but could perhaps be worth it in your
>> specific case.
>> If all you are doing is extracting some data from this large dataset
>> and working on a small subset, you could also split that preliminary
>> step into a separate project. You bite the bullet on a project that
>> manages the large file(s) and create smaller files that are processed
>> in separate projects. My biggest project handles 10GB of files, mostly
>> large raw text files that are input and converted to a smaller Stata
>> dataset (2GB).
>> If it is at all possible, you should consider investing in a faster
>> system. Your description matches the performance of a computer with a
>> regular hard disk. No matter how smart you try to organize your work,
>> loading such a large dataset will take 20 plus seconds. A fast SSD
>> will greatly improve that load time. More RAM can also be helpful as
>> modern operating systems will cache I/O in RAM. It won't help for the
>> first load from disk but the second time will be much faster. On my
>> system, a dataset your size takes about 2 seconds to load, half a
>> second the second time around.
>> On Tue, Jul 23, 2013 at 8:57 PM, Roberto Ferrer <email@example.com> wrote:
>>> User-written package -project- by Robert Picard and installed using:
>>> net from http://robertpicard.com/stata
>>> What is the recommended course of action if I have a big data file
>>> that seems to slow down the project (re)build ?
>>> According to the help file:
>>> "The do(do_filename) build directive will not run do_filename if the
>>> do-file has not changed and all files linked to it have not changed
>>> since the last build."
>>> So I imagine there's a -checksum- slowing down the build even if no
>>> files change. I'm thinking of some option that would tell the build
>>> process to ignore this specific file. This file is the first input in
>>> the whole sequence (an -original-) and I'm sure it cannot change since
>>> it is in a write-protected directory.
>>> I suppose I can take this step out of the build and modify the
>>> corresponding files. At the end of the project, I could stick it back
>>> in. But a build directive like
>>> project, original(dta_filename) ignore_chsum
>>> would be nice.
>>> The data file is 1.4GB in size and a build with no changes is taking
>>> around 30 seconds. I did an isolated -checksum- on the file and it's
>>> over 24 seconds. Other than that one I have few (38 linked) and small
>>> (<2mb) files.
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
* For searches and help try: