Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Using version control software with Stata


From   Phil Schumm <[email protected]>
To   Statalist Statalist <[email protected]>
Subject   Re: st: Using version control software with Stata
Date   Fri, 28 Mar 2014 10:49:16 -0500

On Mar 28, 2014, at 12:23 AM, Timothy Mak <[email protected]> wrote:
> I'm beginning to accumulate quite a lot of different do files and ado for the projects I'm working on, and am thinking of using a version control software to help me keep records of my programs. However, I don't have a computer programming background, and therefore no experience of using these things, and am not even sure it's something that's useful for do and ado files. However, if anyone on the list has experience to share in this regards, I'd very much love to hear. 


Version control systems are definitely useful when working on analytic projects (e.g., for storing do-files, bespoke commands/packages, copies of 3rd-party dependencies, manuscript drafts, etc.).  And they've evolved considerably over the past 5-10 years, so you definitely don't want to try to reinvent them or their functionality.

I don't have much time right now, but here are a few quick pointers:

1) VCS used to be centralized, but now the best systems are decentralized.  You should definitely use the latter (for a million reasons).  The two best free systems are Git (most widely used) and Mercurial.  Git is by far the most capable, but it can also be pretty complex (though you can ignore most of this complexity if you don't need it).  Mercurial has much of Git's functionality, but is perhaps a bit easier to use, at least at first.  One important advantage of Git for this type of work is that many data scientists use Git (e.g., it's been incorporated into RStudio).

2) Spend a little time (e.g., 2-3 hours) with a good intro text on whichever system you choose.  There are many books, tutorials, etc., and most of these are freely available online.

3) Download and use SourceTree (http://www.sourcetreeapp.com).  SourceTree is a free, cross-platform GUI for Git and Mercurial.  It makes learning and using either of these systems much easier.  Moreover, it is the only system of which I'm aware that attempts to abstract out the primary functionality from both Git and Mercurial and thereby provide a unified interface.  Thus, it makes it easy to switch back-and-forth between the two.  There are many other Git and Mercurial GUIs out there, but SourceTree is definitely one of the best.

4) Become familiar with both GitHub and Bitbucket.  GitHub is ubiquitous, of course, but the downside is that it does not permit you to create private repositories without a paid account (and it is exclusively git-based).  In contrast, Bitbucket provides unlimited private Git and Mercurial repositories for people with .edu email addresses.  (Note that SourceTree interfaces nicely with both of these.)

5) Don't check data into a repository hosted online.  This would be a grievous error for primary data collected from human subjects for which you are responsible, and would likely violate many Data User Agreements (DUAs) for secondary data.  Instead, I would suggest storing the raw data for your project in a (secure) location separate from your repository, and then symlink it into your project.

FWIW, I've had a lot of success teaching applied researchers (i.e., people who do quantitative data analysis but are not programmers) to use version control (using either Git or Mercurial), and the benefits in terms of reproducibility, efficiency and ease of collaboration are substantial.


-- Phil


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index