Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: mac editor for big files


From   Phil Schumm <pschumm@uchicago.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: mac editor for big files
Date   Sun, 21 Aug 2011 10:25:52 -0500

On Aug 21, 2011, at 10:02 AM, Richard Goldstein wrote:
> the reason for wanting an editor for these files is that Stata will not correctly import them using -insheet- (or anything else I tried) due to "problems" in the data (e.g., one file had 2 successive end-of-line characters in the middle of what was supposed to be a line of data); so I want to look at the small number of files with apparent problems to see if I can fix them prior to importing; since I don't know what the problem is or even where it occurs, I have been unable to figure out how to use -filefilter-


Rich,

Believe it or not, I encounter exactly this type of thing fairly often (e.g., with genotype files in the 50-100GB range).  I typically use Python to do the type of exploratory diagnostics you're talking about.  For example, a program to read a large delimited file and report on the number of eols, number of items per line (i.e., between each eol), etc. can be written with no more than a dozen lines of code, and will rip through a large file pretty quickly (your rate-limiter here will be system IO anyway).  Once you know where the problem(s) are, it's then usually easy to splice in a fix for the problematic lines.

Some would say this should be done with tools like sed and awk, and indeed, if you're proficient with them, they're great.  However, realistically, you will spend a fair amount of time getting started, and if this is the type of thing you do only occasionally, you will likely have forgotten everything you learned by the next time you need to use it.


-- Phil


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index