Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Improved commands, sample implementations. Any interest?

From	James Sams <[email protected]>
To	[email protected]
Subject	st: Improved commands, sample implementations. Any interest?
Date	Fri, 07 Dec 2012 10:34:50 -0600

I keep changing some user-written commands to suit my purposes or fix things 
that have broken over the years and thought I'd contribute these back. 
However, some peer review may be a good idea before tracking down the 
individual authors and trying to get the changes committed.

Here is a summary of what I have right now:

  * collapse_preserve_label.do: preserve variable and value labels of
    same-named variables when using collapse. I believe StataCorp has an FAQ
    that outlines this program.

  * gzfile.ado: provide ability to interact with gzipped dta files using
    modern syntax of Stata's various file commands (save, use, append, merge).
    Derived from gzsave.

  * indexesof.ado: a variant of levelsof to skirt around macro length issues
    and provide the index within the dataset of each unique value.

  * insheet2.ado: a more reliable insheet, uses replace_dquotes.py.

  * labmask.ado: an update to the original labmask to be faster.
    Depends on indexesof.

  * replace_dquotes.py: Replaces double quotes in csv files to another
    character, e.g pipe ('|'), so that Stata's insheet does not corrupt the
    input.  Assumes there are no |'s in the original data. Replace all |'s in
    all string variables back to double quotes to restore original data. The
    character used is printed to stdout.

  * unique.ado: edited unique command from ssc to accept a compound if stmt.

You can check out the files and future updates/additions at my bitbucket 
repository: https://bitbucket.org/james.sams/statafiles/

There are no help files, but the commands are well documented within each 
source file.


A couple examples of what I've changed:

An example of a performance improvement is labmask.ado, which is derived from 
Nick Cox's labmask. On somewhat larger datasets (a couple of a million 
observations with thousands unique value/label pairs), this version runs in a 
few seconds rather than multiple hours. It also does not require the creation 
of any new variables, just a couple of mata vectors; so, it does not increase 
memory usage much at all.

insheet breaks for me, and others I provide support for, constantly. Between 
truncating data, misinterpreting column breaks, and not using double by 
default, I think insheet should be used more conservatively than most may 
expect given the apparent simplicity of the command, especially since a lot of 
these errors are silent and are not easy to catch.

I wrote insheet2/replace_dquotes.py to try to be a catch-all place to put all 
the necessary guards for insheet, to be used without second thought. I'm not 
100% sure that I've caught everything, but it has worked for me on all the 
datasets that have failed with insheet, with the exception of one observation 
files that do not have a header, which Stata still interprets as having 0 
observations without the 'nonames' argument. 

-- 
James Sams
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Improved commands, sample implementations. Any interest?
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: data file for use with sem()
Next by Date: st: generate Spell Counter or Duration Variable
Previous by thread: st: Interpretation of Box-Cox Results
Next by thread: Re: st: Improved commands, sample implementations. Any interest?
Index(es):
- Date
- Thread