Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -project- and big data files slowing down -build-

From   Robert Picard <>
Subject   Re: st: -project- and big data files slowing down -build-
Date   Tue, 23 Jul 2013 22:23:26 -0400

There's a lot of things to think about when dealing with big datasets.
-project- indeed uses -checksum- to check for changes in dependencies.
It tries to be smart about it by checking files in increasing order of
file size. You are correct however that if there are no changes in any
file, -project- will have to run a -checksum- on your large file to
confirm that. But I can assure you that if any of the do-files have
changed, your master do-file will start running before you can blink.

Personally, I would not tolerate such slowdown either. The simplest
solution is to not declare a dependency for this large dataset. You
don't need a -ignore_chsum- option, just comment the -project,
original()- statement. The downside is that -project- won't be able to
notice changes in the large dataset and react accordingly. It kind of
defeats the point of -project- but could perhaps be worth it in your
specific case.

If all you are doing is extracting some data from this large dataset
and working on a small subset, you could also split that preliminary
step into a separate project. You bite the bullet on a project that
manages the large file(s) and create smaller files that are processed
in separate projects. My biggest project handles 10GB of files, mostly
large raw text files that are input and converted to a smaller Stata
dataset (2GB).

If it is at all possible, you should consider investing in a faster
system. Your description matches the performance of a computer with a
regular hard disk. No matter how smart you try to organize your work,
loading such a large dataset will take 20 plus seconds. A fast SSD
will greatly improve that load time. More RAM can also be helpful as
modern operating systems will cache I/O in RAM. It won't help for the
first load from disk but the second time will be much faster. On my
system, a dataset your size takes about 2 seconds to load, half a
second the second time around.

On Tue, Jul 23, 2013 at 8:57 PM, Roberto Ferrer <> wrote:
> User-written package -project- by Robert Picard and installed using:
> net from
> What is the recommended course of action if I have a big data file
> that seems to slow down the project (re)build ?
> According to the help file:
> "The do(do_filename) build directive will not run do_filename if the
> do-file has not changed and all files linked to it have not changed
> since the last build."
> So I imagine there's a -checksum- slowing down the build even if no
> files change. I'm thinking of some option that would tell the build
> process to ignore this specific file. This file is the first input in
> the whole sequence (an -original-) and I'm sure it cannot change since
> it is in a write-protected directory.
> I suppose I can take this step out of the build and modify the
> corresponding files. At the end of the project, I could stick it back
> in. But a build directive like
>         project, original(dta_filename) ignore_chsum
> would be nice.
> The data file is 1.4GB in size and a build with no changes is taking
> around 30 seconds. I did an isolated -checksum- on the file and it's
> over 24 seconds. Other than that one I have few (38 linked) and small
> (<2mb) files.
> Thanks,
> Roberto
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index