Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Improved commands, sample implementations. Any interest?


From   Jeph Herrin <[email protected]>
To   [email protected]
Subject   Re: st: Improved commands, sample implementations. Any interest?
Date   Fri, 07 Dec 2012 14:04:32 -0500

On this topic (or close to it), I routinely modify SSC and StataCorp .ado files, but exclusively to include additional -return- statements. Eg, to the very handy -levelsof.ado- I added

  return local nvals `"`nvals'"'

so I could capture the number of items in the returned macro without counting it up myself.

On the one hand, it seems like a very "passive" modification to return a macro that the author has been kind enough to work out the contents of and -display- for me to see. On the other hand, I am still reluctant enough about fiddling with others' code that I nontheless change the name (eg, to -mylevelsof.ado-).

All of which is to say, I would encourage StataCorp and especially SSC authors to be liberal with -return-ing calculated values.

cheers,
Jeph


On 12/7/2012 12:16 PM, Nick Cox wrote:
Some etiquette for working with user-written programs has long since
been suggested at

http://www.stata.com/support/faqs/resources/statalist-faq/#relation

I quote the most relevant part

"In practice, you can probably take anything published in either
medium and modify it as you will—especially if you do that
privately—but publicly we recommend that, unless you are the original
author, you change the name of the program, take all blame for any
limitations your changes produce, and imply that a suitably large
portion of the credit for the program belongs to the original
authors."

From that and other considerations my own suggestion is that

1. Publication of user-written commands requires publication of help
files to be taken seriously.

2. Existing names, meaning those attached to existing commands in
Stata or made public through the Stata Journal or SSC or accessible
websites, are to be considered the property of StataCorp or the
program authors. So, you should use new names. Not doing so runs the
risk of confusing many, to say the least. (As above, no help files are
available to document your acknowledgments readably. Documenting what
a program does within the code is natural to programmers, but
manifestly the typical Stata user doesn't expect to have to read the
code.)

3. It would have been courteous to inform existing program authors
privately before publicly advertising "improved versions" of their
programs. In my case you should feel free to publish improved versions
of my programs under different names and with help files.

Nick

On Fri, Dec 7, 2012 at 4:34 PM, James Sams <[email protected]> wrote:
I keep changing some user-written commands to suit my purposes or fix things
that have broken over the years and thought I'd contribute these back.
However, some peer review may be a good idea before tracking down the
individual authors and trying to get the changes committed.

Here is a summary of what I have right now:

   * collapse_preserve_label.do: preserve variable and value labels of
     same-named variables when using collapse. I believe StataCorp has an FAQ
     that outlines this program.

   * gzfile.ado: provide ability to interact with gzipped dta files using
     modern syntax of Stata's various file commands (save, use, append, merge).
     Derived from gzsave.

   * indexesof.ado: a variant of levelsof to skirt around macro length issues
     and provide the index within the dataset of each unique value.

   * insheet2.ado: a more reliable insheet, uses replace_dquotes.py.

   * labmask.ado: an update to the original labmask to be faster.
     Depends on indexesof.

   * replace_dquotes.py: Replaces double quotes in csv files to another
     character, e.g pipe ('|'), so that Stata's insheet does not corrupt the
     input.  Assumes there are no |'s in the original data. Replace all |'s in
     all string variables back to double quotes to restore original data. The
     character used is printed to stdout.

   * unique.ado: edited unique command from ssc to accept a compound if stmt.

You can check out the files and future updates/additions at my bitbucket
repository: https://bitbucket.org/james.sams/statafiles/

There are no help files, but the commands are well documented within each
source file.


A couple examples of what I've changed:

An example of a performance improvement is labmask.ado, which is derived from
Nick Cox's labmask. On somewhat larger datasets (a couple of a million
observations with thousands unique value/label pairs), this version runs in a
few seconds rather than multiple hours. It also does not require the creation
of any new variables, just a couple of mata vectors; so, it does not increase
memory usage much at all.

insheet breaks for me, and others I provide support for, constantly. Between
truncating data, misinterpreting column breaks, and not using double by
default, I think insheet should be used more conservatively than most may
expect given the apparent simplicity of the command, especially since a lot of
these errors are silent and are not easy to catch.

I wrote insheet2/replace_dquotes.py to try to be a catch-all place to put all
the necessary guards for insheet, to be used without second thought. I'm not
100% sure that I've caught everything, but it has worked for me on all the
datasets that have failed with insheet, with the exception of one observation
files that do not have a header, which Stata still interprets as having 0
observations without the 'nonames' argument.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index