Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -project- and big data files slowing down -build-


From   Roberto Ferrer <refp16@gmail.com>
To   Stata Help <statalist@hsphsun2.harvard.edu>
Subject   Re: st: -project- and big data files slowing down -build-
Date   Wed, 24 Jul 2013 15:40:30 +0100

Thank you for your reply, Robert.

Let me see if I understand correctly:
-project- checks if any do-file was modified. If it finds one, then it
runs that do-file and that of any dependency. In this case,  there is
no need to -checksum- data files associated with the do-files. But if
no do-file was modified, then it goes on with the -checksum- of every
file in the project to verify this.

I'm still confused on the order of the file checking. Does it  check
all do-files and then all data files? Does it check only considering
size and independently of type? Or does it do something else?

Probably the best way to figure out all this stuff out is looking at
the source file for -project- but it will probably take me some time
to understand. I plan on doing so in the future.

As for the -ignore_chsum- option I mentioned in my previous email, I
still think it would be useful. It can be a bit dangerous but it gives
the user more power. It gives the opportunity to include a file in the
build process although the file is not expected to change (e.g.
because it is in a write-protected directory). It would be good for
documenting purposes since -project, list(build)- no longer mentions
the file if we comment out the -project, original()- and it would keep
everything in a unique project. I would just flag it somehow (a *
maybe) in the listing output.

As to my solution for now, I'll follow your advice and do exactly
that: comment out -project, original()- for the big file. I'd rather
do that than create a new project for a simple do-file.

Thanks,
Roberto

On Wed, Jul 24, 2013 at 3:23 AM, Robert Picard <picard@netbox.com> wrote:
> There's a lot of things to think about when dealing with big datasets.
> -project- indeed uses -checksum- to check for changes in dependencies.
> It tries to be smart about it by checking files in increasing order of
> file size. You are correct however that if there are no changes in any
> file, -project- will have to run a -checksum- on your large file to
> confirm that. But I can assure you that if any of the do-files have
> changed, your master do-file will start running before you can blink.
>
> Personally, I would not tolerate such slowdown either. The simplest
> solution is to not declare a dependency for this large dataset. You
> don't need a -ignore_chsum- option, just comment the -project,
> original()- statement. The downside is that -project- won't be able to
> notice changes in the large dataset and react accordingly. It kind of
> defeats the point of -project- but could perhaps be worth it in your
> specific case.
>
> If all you are doing is extracting some data from this large dataset
> and working on a small subset, you could also split that preliminary
> step into a separate project. You bite the bullet on a project that
> manages the large file(s) and create smaller files that are processed
> in separate projects. My biggest project handles 10GB of files, mostly
> large raw text files that are input and converted to a smaller Stata
> dataset (2GB).
>
> If it is at all possible, you should consider investing in a faster
> system. Your description matches the performance of a computer with a
> regular hard disk. No matter how smart you try to organize your work,
> loading such a large dataset will take 20 plus seconds. A fast SSD
> will greatly improve that load time. More RAM can also be helpful as
> modern operating systems will cache I/O in RAM. It won't help for the
> first load from disk but the second time will be much faster. On my
> system, a dataset your size takes about 2 seconds to load, half a
> second the second time around.
>
>
> On Tue, Jul 23, 2013 at 8:57 PM, Roberto Ferrer <refp16@gmail.com> wrote:
>> User-written package -project- by Robert Picard and installed using:
>> net from http://robertpicard.com/stata
>>
>> What is the recommended course of action if I have a big data file
>> that seems to slow down the project (re)build ?
>>
>> According to the help file:
>>
>> "The do(do_filename) build directive will not run do_filename if the
>> do-file has not changed and all files linked to it have not changed
>> since the last build."
>>
>> So I imagine there's a -checksum- slowing down the build even if no
>> files change. I'm thinking of some option that would tell the build
>> process to ignore this specific file. This file is the first input in
>> the whole sequence (an -original-) and I'm sure it cannot change since
>> it is in a write-protected directory.
>>
>> I suppose I can take this step out of the build and modify the
>> corresponding files. At the end of the project, I could stick it back
>> in. But a build directive like
>>
>>         project, original(dta_filename) ignore_chsum
>>
>> would be nice.
>>
>> The data file is 1.4GB in size and a build with no changes is taking
>> around 30 seconds. I did an isolated -checksum- on the file and it's
>> over 24 seconds. Other than that one I have few (38 linked) and small
>> (<2mb) files.
>>
>> Thanks,
>> Roberto
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index