Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: options for syntax checking do files before running

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: options for syntax checking do files before running
Date	Wed, 10 Apr 2013 11:16:07 +0100

You are presuming world-wide familiarity with Ann Arbor. I thought she
was a 1940s starlet.

We can all hope for presents from generous but unpredictable uncles
and aunts, as StataCorp may be pictured here, but m
y guess is that much less is possible than you imagine.

As you said yourself, do-files are interpreted, line by line, which is
crucial. There leaves _some_ scope for spotting errors ahead of
execution. For example, a program could read through a do-file and
check that parentheses, brackets and braces and single and double
quotation marks were paired. Indeed I sometimes run -hexdump- on a
long program file to look for problems of that kind. Arguably, this is
what syntax highlighting in a good text editor should help with. (Some
people seem to value special colo[u]rs for command names, which seems
to me neither here nor there.)

That said, when you talk about the "syntax checker" you talk about
code that you can't access, and neither can I. But we can understand
roughly what it does from how it behaves.

The key point is that it works line by line and indeed token by token.

The crunch is this: What is legal depends not only on the syntax
defined for particular commands but on the state of your dataset at
that instant. The common structure

cmdname <varlist>

will be legal if and only if what appears after the command name is
legal as a varlist. That depends totally on what variables you have in
memory in that point. Even an extended syntax checker could not be
expected to keep track of that ahead of time.

The token-by-token parsing often bites hard. In another forum a user
was bitten by this (I've edited slightly from the original):

* begin example

I want to export data from Stata into a csv-file:

 outsheet "$dirLink/analysis.csv",  replace comma

but I got the following error message:

 factor variables and time-series operators not allowed
 r(101);

* end example.

What's going wrong here? It takes a human who has learned the syntax
or who reads the help file to say "You omitted -using-". So why is
Stata doing what may look like a lousy job, particularly as no factor
variables or time series operators are being used?

It is doing a lousy job because it is focusing only on the first token
it doesn't understand, which is

"$dirLink/analysis.csv"

Stata can't figure that out. It's not -using- and it's not a varlist,
which are both legal at this point. The presence of the period causes
a diagnosis that the user is trying a time series operator here, which
would be illegal. That diagnosis is wrong: it takes a human to spot
immediately that the next token is intended as a filename. (I'm
presuming here that the global has been expanded and its contents are
unproblematic.) Given that, it's immediate that -using- has been
omitted.

It's really hard for a program to spot that kind of thing. Naturally,
Stata's code could be bloated by adding all sorts of ad hoc extras.
Every time there is a common problem, you could put in a trap to catch
it and to explain it nicely. That's what user-programmers can try to
do for their own microcosms, but it's limitless.

Let's take another example, your original.

replace myvar = "something" if missing(myvar) | if regexm(myvar, "myregex")

It takes a human knowing Stata to see that the second "if" is wrong.
But it's not inevitably an error to repeat -if- in a command. I can go

scatter mpg weight if foreign == 0 || scatter mpg weight if foreign == 1

and that's fine. So, it would be hard work for Stata to spot that kind
of error.

My crystal ball is just as cloudy as anyone else's, and I do hope and
expect that StataCorp will enhance the do-file editor to make it
easier to spot problems, but I'd be surprised at very much more.

Nick
[email protected]

On 10 April 2013 03:48, Christopher Zbrozek <[email protected]> wrote:
> Thanks, Nick!
>
> I agree it's best to use Stata's syntax checker... and, cough, ahem,
> if anyone from Statacorp is listening, if users had the ability in a
> future release to quickly scan a dofile for syntax errors using the
> executable's syntax checker (without actually needing running the code
> on a dataset), you'd have some happy users in Ann Arbor.
>
> Best,
> Christopher Zbrozek
> University of Michigan
>
> On Mon, Apr 8, 2013 at 3:55 PM, Nick Cox <[email protected]> wrote:
>> I don't think there is any such program. It would be foolish of me to
>> rule out the possibility of some script that catches common errors,
>> but my experience is that it is best to let Stata itself find your
>> bugs and miscodings as fast as possible.
>>
>> Nick
>>
>> Nick
>> [email protected]
>>
>>
>> On 8 April 2013 20:28, Christopher Zbrozek <[email protected]> wrote:
>>> Hello world,
>>>
>>> Is anyone aware of user-written code that would review a do file to
>>> ensure commands are written using valid Stata syntax? Or,
>>> alternatively, is there a way to exploit the Stata executable's
>>> internal syntax checker for this purpose?
>>>
>>> The idea is that because Stata code is interpreted rather than
>>> compiled, clearly boneheaded syntax errors aren't caught until
>>> runtime. For example, when insufficiently caffeinated this morning, I
>>> tried to use a command along the lines of
>>>
>>> replace myvar = "something" if missing(myvar) | if regexm(myvar, "myregex")
>>>
>>> which, with that second "if" in there, works about as well as one
>>> thinks it should.
>>>
>>> One (quasi-)solution is of course to debug code using a small sample
>>> dataset before running it on millions of observations, which
>>> ameliorates but doesn't fix the problem. A rather laborious solution
>>> would be to write an ado file or Perl script or something to find
>>> common syntax errors and run that at the top of a do file on the text
>>> of the do file itself. An ideal solution would be to somehow employ
>>> the syntax checking Stata will perform anyway rather than trying to
>>> reverse-engineer that portion of the Stata executable.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: options for syntax checking do files before running
  - From: Christopher Zbrozek <[email protected]>
- Re: st: options for syntax checking do files before running
  - From: Nick Cox <[email protected]>
- Re: st: options for syntax checking do files before running
  - From: Christopher Zbrozek <[email protected]>

Prev by Date: st: CMMs Online Multilevel Modelling Course Three Level Cross Classified and Multiple Membership Models
Next by Date: st: Dopping 1% observations, but numbers do not match
Previous by thread: Re: st: options for syntax checking do files before running
Next by thread: st: calculating PAR when there are >2 categories within a variable
Index(es):
- Date
- Thread