[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Christopher Baum <firstname.lastname@example.org>
Sun, 21 Aug 2011 14:18:45 -0400
Believe it or not, I encounter exactly this type of thing fairly often (e.g., with genotype files in the 50-100GB range). I typically use Python to do the type of exploratory diagnostics you're talking about. For example, a program to read a large delimited file and report on the number of eols, number of items per line (i.e., between each eol), etc. can be written with no more than a dozen lines of code, and will rip through a large file pretty quickly (your rate-limiter here will be system IO anyway). Once you know where the problem(s) are, it's then usually easy to splice in a fix for the problematic lines.
Some would say this should be done with tools like sed and awk, and indeed, if you're proficient with them, they're great. However, realistically, you will spend a fair amount of time getting started, and if this is the type of thing you do only occasionally, you will likely have forgotten everything you learned by the next time you need to use it.
As you're on a *nix system, the wc command is pretty useful. It can give you a character, 'word', and line count very rapidly. E.g.,
bcvpn34:~ cfbaum$ wc ifsitemunits.txt
56974 179684 878191 ifsitemunits.txt
Kit Baum | Boston College Economics & DIW Berlin | http://ideas.repec.org/e/pba1.html
An Introduction to Stata Programming | http://www.stata-press.com/books/isp.html
An Introduction to Modern Econometrics Using Stata | http://www.stata-press.com/books/imeus.html
* For searches and help try: