Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Using version control software with Stata

From	Stas Kolenikov <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Using version control software with Stata
Date	Mon, 31 Mar 2014 12:32:57 -0500

On top of what the comments Phil Schumm has already provided (which
are all on spot),

On Mon, Mar 31, 2014 at 3:51 AM, Timothy Mak <[email protected]> wrote:
>
> 1. It seems to me what a Version Control System (VCS) does is that it automatically keeps a record of all changes to my file, numbering them sequentially, perhaps every time I save it. This record would be kept in a separate file (or files) to the current (most recent) do file.

Revision control systems usually take snapshots when you ask them to.
Usually it means incorporating a certain non-negligible change, and
possibly re-running the part of the project that is affected by that
change.

I imagine that some of the revision control systems can keep track of
the dependencies for major programming languages, but I doubt any of
them would support anything as exotic as Stata. This is especially
difficult given that you may pass the name of the program1 to be run
to program2, and even though there may have been no change in
program2, the revision control system will have very hard time
figuring out from the contents of program2 that it depends on
program1. Think about prefixes like -bootstrap:- or -svy:- that are
agnostic to what follows by design; you may have something similar in
your projects, and no text analysis would be able to uncover such
script-based dependencies. You have to declare them explicitly.

Partially, for data file dependencies, such capacity is provided by
Robert Picard's -project-. At the top of a do-file, you can declare
all the dependencies that you, as the developer, have in mind,
including the data that the do-file -use-s, and the ado-files that it
calls. At the bottom of the do-file, you can declare all the new files
that the do-file creates, including .dta files, .ster files, whatever
you have produced with -estout- or -outreg-, etc. Then -project-
creates an internal table of cross-dependencies, and when you -build-
the project, only runs the paths that need to be run based on what you
updated in your code and/or data. However, -project- loathes passing
anything to do-files as parameters. So -project- is strongly oriented
at complex data management and/or analysis tasks, and is not suitable
for pure ado-code development (but the standard revision control
systems like Git or Mercurial take a wonderful care of most tasks).
The learning curve is somewhat steep, but it is worth figuring out,
just to stop worrying about forgetting to run something at a far
corner of the project to get the things aligned.

> 2. However, say I want to send the most recent do-file to someone, but I want to tag it as version 0.1.123, e.g. as part of the preamble in the do-file. I guess I won't be able to do this automatically if I use Stata's do-file editor, right? I'll still have to manually type in "* Version 0.1.123" in the first line. Or perhaps the VCS can automatically add this line to the top of the file every time it is saved, even if I just press Ctrl-S in the do-file editor?

See Phil's comment. If you are working in a team on a project, the
best way is to set up a repository from which everybody would pull the
common do-files. Doing so would enforce everybody to have a coherent
version of the project, at least to the extent that the developers
keep it coherent. Re-running everything top to bottom with the
specifically designed test cases is usually the best way to make sure
that the system still works as intended. Updating these test cases as
the new functionality is added is crucial for this system to do what
it's supposed to be doing (most test cases are actually intended to
produce, and catch, an error, such as poor data, incompatible options,
misspelled variable names by the user, etc.). So if everybody on the
project has the habit of pulling the latest set of updates from the
repository when they come to work on the project, there's an assurance
that they will be working with the right files. You can configure
repositories to allow the developers to make changes, and analysts, to
only pick the modified files. Emailing a separate file always runs the
risk of conflicts, however.

I try to align my version numbers with the revision numbers in my VCS,
at least for the main files of the project. This means that sometimes
I would skip from *! version 1.13 to *! version 1.35 if I did not need
to modify a certain file for the intermediate 22 changes that
concerned other parts of the project.

> 3. Another thing is when one do-file references another. Currently my practice is to have a "V#" added to the end of the name of nearly all my do-files, e.g. "mydofileV1.do". Thus all references to do-files have to indicate the version number also. However, if I do use a VCS, where the previous versions are not stored as separate files, presumably I'll have to adopt a different system. And because the do-file editor is not really a developer's platform, I struggle to see how this can be easily done.

If you want to stick to this convention, you can pass version as a
parameter (although you probably want to name that option something
other than -version- which is a reserved Stata word; may be
-birthmark()- or something). Revision control systems add a unique tag
to each version you save (which is usually some sort of a hash), and
you can use that tag in verifying which version of the software does
what.

> 4. With ado-files, I cannot even implement my system above, since if my program is -myado-, I must name it "myado.ado", and not "myadoV1.ado". Therefore, at the moment, I often write my ado-files as do files, and -run "myadoV1.do"-, before using -myado ...-.

That's awkward.

> 5. At the same time, I wonder how professional programmers cope with this problem. It is surely impractical to have names of programs/functions/modules that change with every version. Presumably, if it is a compiled language, then whenever the program is compiled, under a VCS, the version number of each of the program/function/module that is called can also be recorded. But what about scripting languages?

See Phil's answer :)

> 6. Ideally, I'll have a system where every time I save a do-file, it will also save in its preamble the version number of all the do-files and ado-files that it refers to. Even better is that it will automatically find the appropriate version of that program/function every time I run the do file also. I know this is probably well beyond what can be practically achieved.

I am pretty sure this unnecessarily complicated, and the proper
revision control/project management system will keep a better record
internally that you could mentally: you don't want to see a preamble
that lists more than three files, and in realistic projects, you will
end up with 20+ if not 50+ dependencies by the time you get to your
last "produce_the_pdf_report.do" file.

I don't see any problems with different files having different
versions, and/or different generations of versions:

. version
version 13.1

. which regress
C:\Stata13\ado\base\r\regress.ado
*! version 1.3.0 14apr2011

. which bootstrap
C:\Stata13\ado\base\b\bootstrap.ado
*! version 4.7.0 12apr2013

. which logit
C:\Stata13\ado\base\l\logit.ado
*! version 11.1.0 16jun2011

. which svyset
C:\Stata13\ado\base\s\svyset.ado
*! version 3.3.1 20apr2010

Somehow, Stata Corp has been managing to have version 13.1 of the
whole product along with version 1.3.0 of -regress-. Has -logit.ado-
been reviewed more times than -regress-? This is an immaterial
question as long as they all function as documented. I know from my
own Stata coding experience that the three major first-digit versions
of -svyset- save the settings in different enough ways that trying to
write my own code to hack into them has been quite disappointing.
Instead, I learned to rely on the returned values of -svyset- to pick
up the names of the weights, strata and PSU variables when I need them
in my custom code for survey estimation that I do on my own. That's
how good programming is done: I do not need to know the version of
-svyset- and how it works internally, I only need to know the hooks
that it provides externally. If I write the code that way, then if
something changes with how -svyset- internally stores the meta-data
about the sampling design, my code is not screwed up beyond
dysfunctional. This, of course, only works insofar as the new
revisions of -svyset- continue providing these hooks, which is the
responsibility of Stata Corp. to ensure; I am honestly expecting that
they will do so for backwards compatibility reasons, as the new
prospective versions of -svyset- would have to be able to deal with
the their own old estimation commands that have relied on the returned
values from the current version.

-- Stas Kolenikov, PhD, PStat (ASA, SSC)
-- Principal Survey Scientist, Abt SRBI
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
-- http://stas.kolenikov.name

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/

References:
- st: Using version control software with Stata
  - From: Timothy Mak <[email protected]>
- Re: st: Using version control software with Stata
  - From: Jeph Herrin <[email protected]>
- RE: st: Using version control software with Stata
  - From: Timothy Mak <[email protected]>

Prev by Date: Re: st: Access scheme colors in program
Next by Date: st: RE: Au revoir Statalist. Hello Statalist.
Previous by thread: Re: st: Using version control software with Stata
Next by thread: st: Assumptions for continuous predictor in negative binomial regression model
Index(es):
- Date
- Thread