Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: options for syntax checking do files before running


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: options for syntax checking do files before running
Date   Wed, 10 Apr 2013 11:16:07 +0100

You are presuming world-wide familiarity with Ann Arbor. I thought she
was a 1940s starlet.

We can all hope for presents from generous but unpredictable uncles
and aunts, as StataCorp may be pictured here, but m
y guess is that much less is possible than you imagine.

As you said yourself, do-files are interpreted, line by line, which is
crucial. There leaves _some_ scope for spotting errors ahead of
execution. For example, a program could read through a do-file and
check that parentheses, brackets and braces and single and double
quotation marks were paired. Indeed I sometimes run -hexdump- on a
long program file to look for problems of that kind. Arguably, this is
what syntax highlighting in a good text editor should help with. (Some
people seem to value special colo[u]rs for command names, which seems
to me neither here nor there.)

That said, when you talk about the "syntax checker" you talk about
code that you can't access, and neither can I. But we can understand
roughly what it does from how it behaves.

The key point is that it works line by line and indeed token by token.

The crunch is this: What is legal depends not only on the syntax
defined for particular commands but on the state of your dataset at
that instant. The common structure

cmdname <varlist>

will be legal if and only if what appears after the command name is
legal as a varlist. That depends totally on what variables you have in
memory in that point. Even an extended syntax checker could not be
expected to keep track of that ahead of time.

The token-by-token parsing often bites hard. In another forum a user
was bitten by this (I've edited slightly from the original):

* begin example

I want to export data from Stata into a csv-file:

 outsheet "$dirLink/analysis.csv",  replace comma

but I got the following error message:

 factor variables and time-series operators not allowed
 r(101);

* end example.

What's going wrong here? It takes a human who has learned the syntax
or who reads the help file to say "You omitted -using-". So why is
Stata doing what may look like a lousy job, particularly as no factor
variables or time series operators are being used?

It is doing a lousy job because it is focusing only on the first token
it doesn't understand, which is

"$dirLink/analysis.csv"

Stata can't figure that out. It's not -using- and it's not a varlist,
which are both legal at this point. The presence of the period causes
a diagnosis that the user is trying a time series operator here, which
would be illegal. That diagnosis is wrong: it takes a human to spot
immediately that the next token is intended as a filename. (I'm
presuming here that the global has been expanded and its contents are
unproblematic.) Given that, it's immediate that -using- has been
omitted.

It's really hard for a program to spot that kind of thing. Naturally,
Stata's code could be bloated by adding all sorts of ad hoc extras.
Every time there is a common problem, you could put in a trap to catch
it and to explain it nicely. That's what user-programmers can try to
do for their own microcosms, but it's limitless.

Let's take another example, your original.

replace myvar = "something" if missing(myvar) | if regexm(myvar, "myregex")

It takes a human knowing Stata to see that the second "if" is wrong.
But it's not inevitably an error to repeat -if- in a command. I can go

scatter mpg weight if foreign == 0 || scatter mpg weight if foreign == 1

and that's fine. So, it would be hard work for Stata to spot that kind
of error.

My crystal ball is just as cloudy as anyone else's, and I do hope and
expect that StataCorp will enhance the do-file editor to make it
easier to spot problems, but I'd be surprised at very much more.

Nick
njcoxstata@gmail.com


On 10 April 2013 03:48, Christopher Zbrozek <zbrozek@gmail.com> wrote:
> Thanks, Nick!
>
> I agree it's best to use Stata's syntax checker... and, cough, ahem,
> if anyone from Statacorp is listening, if users had the ability in a
> future release to quickly scan a dofile for syntax errors using the
> executable's syntax checker (without actually needing running the code
> on a dataset), you'd have some happy users in Ann Arbor.
>
> Best,
> Christopher Zbrozek
> University of Michigan
>
> On Mon, Apr 8, 2013 at 3:55 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> I don't think there is any such program. It would be foolish of me to
>> rule out the possibility of some script that catches common errors,
>> but my experience is that it is best to let Stata itself find your
>> bugs and miscodings as fast as possible.
>>
>> Nick
>>
>> Nick
>> njcoxstata@gmail.com
>>
>>
>> On 8 April 2013 20:28, Christopher Zbrozek <zbrozek@gmail.com> wrote:
>>> Hello world,
>>>
>>> Is anyone aware of user-written code that would review a do file to
>>> ensure commands are written using valid Stata syntax? Or,
>>> alternatively, is there a way to exploit the Stata executable's
>>> internal syntax checker for this purpose?
>>>
>>> The idea is that because Stata code is interpreted rather than
>>> compiled, clearly boneheaded syntax errors aren't caught until
>>> runtime. For example, when insufficiently caffeinated this morning, I
>>> tried to use a command along the lines of
>>>
>>> replace myvar = "something" if missing(myvar) | if regexm(myvar, "myregex")
>>>
>>> which, with that second "if" in there, works about as well as one
>>> thinks it should.
>>>
>>> One (quasi-)solution is of course to debug code using a small sample
>>> dataset before running it on millions of observations, which
>>> ameliorates but doesn't fix the problem. A rather laborious solution
>>> would be to write an ado file or Perl script or something to find
>>> common syntax errors and run that at the top of a do file on the text
>>> of the do file itself. An ideal solution would be to somehow employ
>>> the syntax checking Stata will perform anyway rather than trying to
>>> reverse-engineer that portion of the Stata executable.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index