Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: re:

From   Christopher Baum <[email protected]>
To   "[email protected]" <[email protected]>
Subject   st: re:
Date   Sun, 21 Aug 2011 14:18:45 -0400


Phil said

Believe it or not, I encounter exactly this type of thing fairly often (e.g., with genotype files in the 50-100GB range).  I typically use Python to do the type of exploratory diagnostics you're talking about.  For example, a program to read a large delimited file and report on the number of eols, number of items per line (i.e., between each eol), etc. can be written with no more than a dozen lines of code, and will rip through a large file pretty quickly (your rate-limiter here will be system IO anyway).  Once you know where the problem(s) are, it's then usually easy to splice in a fix for the problematic lines.

Some would say this should be done with tools like sed and awk, and indeed, if you're proficient with them, they're great.  However, realistically, you will spend a fair amount of time getting started, and if this is the type of thing you do only occasionally, you will likely have forgotten everything you learned by the next time you need to use it.

As you're on a *nix system, the wc command is pretty useful. It can give you a character, 'word', and line count very rapidly.  E.g.,

bcvpn34:~ cfbaum$ wc ifsitemunits.txt
   56974  179684  878191 ifsitemunits.txt

Kit Baum   |   Boston College Economics & DIW Berlin   |
                              An Introduction to Stata Programming  |
   An Introduction to Modern Econometrics Using Stata  |

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index