Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: New version of file-chunking utility -chunky- is available from SSC


From   David Elliott <dcelliott@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: New version of file-chunking utility -chunky- is available from SSC
Date   Wed, 1 Sep 2010 16:27:28 -0300

Thanks to Kit Baum, a new version of my file-chunking utility -chunky-
is available from SSC. The previous version is still available but has
been deprecated as -chunky8-

-chunky- has a completely new syntax and if you have used it
previously you will have to rewrite your routine.  However, it will
achieve in a single command what previously required a loop since the
looping logic is now built into the routine.  The use of new logic and
Mata subroutines has resulted in up to several orders of magnitude
speed increase on chunking larger files depending on hardware and
network configurations.  -chunky- can handle automatically naming the
chunk files with a user specified stub and provision is made for
handling the header line present in many test output formats.  New
pre-chunking file analysis options are available to examine file
structure and help anticipate any infiling problems.

Known issues:
The routine fails on very wide (>32k) input lines (32,768 character
limit of Mata fget())
There are errors in the help file in the notes section example code.
(These will be corrected in the next point release)

User feedback is appreciated, especially from Mac users since I do not
have access to a MacStata user.

Thanks to Amresh Hanchate for presenting the initial challenge to
redevelop -chunky- based on problems he was having and to Dan
Blanchette for his testing and diligent error-finding.

DC Elliott

TITLE
      'CHUNKY': module to chunk a large text file into smaller parts

DESCRIPTION
       chunky breaks a large text file into chunks of a size specified
      by the user. It is typically used to break a huge data dump that
      is too large for infiling into smaller manageable chunks. chunky
      will allow creation of serially named chunks for subsequent
      infiling or insheeting. The smaller data subsets can then be
      appended together to create a dataset with all required
      observations. This version of chunky has been completely
      rewritten to use the Mata capabilities of Stata release 9 and
      higher and the syntax has completely changed.  The previous
      version has been deprecated as  chunky8.  Some users may still
      require a line-indexed method of chunking files so chunky8 will
      continue to be supported.

TITLE
      'CHUNKY8': module to chunk a large text file into smaller parts
(version 8)

DESCRIPTION
       chunky8 breaks a large text file into user specifiable
      manageable chunks. It is typically used to break a huge data dump
      that is too large for infiling into smaller manageable chunks.
      chunky8 will allow serial chunking and then infiling or
      insheeting. The smaller data subsets can then be appended
      together to create a dataset with all required observations.
      chunky8 is the deprecated previous version of chunky, the latter
      having been completely rewritten to use the Mata capabilities of
      Stata release 9 and higher.  Some users may still require a
      line-indexed method of chunking files so chunky8 will continue to
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index