Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -project- and big data files slowing down -build-
Roberto Ferrer <firstname.lastname@example.org>
Stata Help <email@example.com>
Re: st: -project- and big data files slowing down -build-
Mon, 29 Jul 2013 20:48:24 +0100
Thank you Robert.
It is much clearer now.
I will keep working with -project- and post any comments/doubts.
On Wed, Jul 24, 2013 at 4:53 PM, Robert Picard <firstname.lastname@example.org> wrote:
> Glad to hear that commenting out the dependency works for you. There
> is indeed a cost in doing so in that the file will not appear in the
> various listings and will be swept away if you use the -cleanup-
> option to remove the files that are not part of the project from the
> project directory.
> With respect to the -checksum- calls, the first thing to understand is
> that -checksum- is called only once per build per file. So if your big
> file is used in 20 different do-files, the -checksum- is only computed
> once. In terms of order of checking for changes, as I explained
> earlier, it is done from the smallest to the largest file. This
> process happens in the order do-files are executed (or skipped). The
> master do-file inherits the dependencies of all do-files in the
> project. Therefore any change to any file in the project will
> immediately trigger its execution. When -project- encounters the next
> -project, do()- statement, it checks the do-file's dependencies to
> determine if it has to be run again. Like for the master do-file, any
> do-file inherits the dependencies of do-files nested within it and as
> before, the check is done from smallest to largest files. Any checksum
> already computed since the start of the build is reused. And so on...
> So there is certainly overhead in using -project- that you would not
> have otherwise. When working with large files, this overhead remains
> low compared to all the other tasks (analysis and estimation) you are
> likely to do. Some projects I work with take more than 24 hours to
> perform a full replication build. They do include gigabyte-size files
> and yet performing a build on them can take me to any do-file I'm
> working on (just edited) i a few seconds at most.
> On Wed, Jul 24, 2013 at 10:40 AM, Roberto Ferrer <email@example.com> wrote:
>> Thank you for your reply, Robert.
>> Let me see if I understand correctly:
>> -project- checks if any do-file was modified. If it finds one, then it
>> runs that do-file and that of any dependency. In this case, there is
>> no need to -checksum- data files associated with the do-files. But if
>> no do-file was modified, then it goes on with the -checksum- of every
>> file in the project to verify this.
>> I'm still confused on the order of the file checking. Does it check
>> all do-files and then all data files? Does it check only considering
>> size and independently of type? Or does it do something else?
>> Probably the best way to figure out all this stuff out is looking at
>> the source file for -project- but it will probably take me some time
>> to understand. I plan on doing so in the future.
>> As for the -ignore_chsum- option I mentioned in my previous email, I
>> still think it would be useful. It can be a bit dangerous but it gives
>> the user more power. It gives the opportunity to include a file in the
>> build process although the file is not expected to change (e.g.
>> because it is in a write-protected directory). It would be good for
>> documenting purposes since -project, list(build)- no longer mentions
>> the file if we comment out the -project, original()- and it would keep
>> everything in a unique project. I would just flag it somehow (a *
>> maybe) in the listing output.
>> As to my solution for now, I'll follow your advice and do exactly
>> that: comment out -project, original()- for the big file. I'd rather
>> do that than create a new project for a simple do-file.
>> On Wed, Jul 24, 2013 at 3:23 AM, Robert Picard <firstname.lastname@example.org> wrote:
>>> There's a lot of things to think about when dealing with big datasets.
>>> -project- indeed uses -checksum- to check for changes in dependencies.
>>> It tries to be smart about it by checking files in increasing order of
>>> file size. You are correct however that if there are no changes in any
>>> file, -project- will have to run a -checksum- on your large file to
>>> confirm that. But I can assure you that if any of the do-files have
>>> changed, your master do-file will start running before you can blink.
>>> Personally, I would not tolerate such slowdown either. The simplest
>>> solution is to not declare a dependency for this large dataset. You
>>> don't need a -ignore_chsum- option, just comment the -project,
>>> original()- statement. The downside is that -project- won't be able to
>>> notice changes in the large dataset and react accordingly. It kind of
>>> defeats the point of -project- but could perhaps be worth it in your
>>> specific case.
>>> If all you are doing is extracting some data from this large dataset
>>> and working on a small subset, you could also split that preliminary
>>> step into a separate project. You bite the bullet on a project that
>>> manages the large file(s) and create smaller files that are processed
>>> in separate projects. My biggest project handles 10GB of files, mostly
>>> large raw text files that are input and converted to a smaller Stata
>>> dataset (2GB).
>>> If it is at all possible, you should consider investing in a faster
>>> system. Your description matches the performance of a computer with a
>>> regular hard disk. No matter how smart you try to organize your work,
>>> loading such a large dataset will take 20 plus seconds. A fast SSD
>>> will greatly improve that load time. More RAM can also be helpful as
>>> modern operating systems will cache I/O in RAM. It won't help for the
>>> first load from disk but the second time will be much faster. On my
>>> system, a dataset your size takes about 2 seconds to load, half a
>>> second the second time around.
>>> On Tue, Jul 23, 2013 at 8:57 PM, Roberto Ferrer <email@example.com> wrote:
>>>> User-written package -project- by Robert Picard and installed using:
>>>> net from http://robertpicard.com/stata
>>>> What is the recommended course of action if I have a big data file
>>>> that seems to slow down the project (re)build ?
>>>> According to the help file:
>>>> "The do(do_filename) build directive will not run do_filename if the
>>>> do-file has not changed and all files linked to it have not changed
>>>> since the last build."
>>>> So I imagine there's a -checksum- slowing down the build even if no
>>>> files change. I'm thinking of some option that would tell the build
>>>> process to ignore this specific file. This file is the first input in
>>>> the whole sequence (an -original-) and I'm sure it cannot change since
>>>> it is in a write-protected directory.
>>>> I suppose I can take this step out of the build and modify the
>>>> corresponding files. At the end of the project, I could stick it back
>>>> in. But a build directive like
>>>> project, original(dta_filename) ignore_chsum
>>>> would be nice.
>>>> The data file is 1.4GB in size and a build with no changes is taking
>>>> around 30 seconds. I did an isolated -checksum- on the file and it's
>>>> over 24 seconds. Other than that one I have few (38 linked) and small
>>>> (<2mb) files.
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
* For searches and help try: