Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: mac editor for big files


From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: mac editor for big files
Date   Mon, 22 Aug 2011 03:22:59 +0000

<>
On Aug 21, 2011, at 10:02 AM, Richard Goldstein wrote:
> the reason for wanting an editor for these files is that Stata will not
> correctly import them using -insheet- (or anything else I tried) due to
> "problems" in the data (e.g., one file had 2 successive end-of-line
> characters in the middle of what was supposed to be a line of data); so
> I want to look at the small number of files with apparent problems to
> see if I can fix them prior to importing; since I don't know what the
> problem is or even where it occurs, I have been unable to figure out how
> to use -filefilter-

What commands did you try? -- what were the issues you had with -filefilter- and -insheet-? Did you try importing the text file with -intext- (from SSC)?  I've had luck with importing these types of files into Stata with -intext-, even with problematic files.

> 
> I downloaded micvim […] micvim appears to work well, though I still
> haven't figured out how to get it to show what are ordinarily "hidden"
> characters (e.g., tab, eol);

Take a look at these suggestions for macvim for the issue you describe:  http://superuser.com/questions/249289/display-invisible-characters-in-vim

> I need to see these to "fix" at least two
> files; however, just opening the files solved one problem for me: it
> told me how many lines of data there were supposed to be so I can check
> the results of importing


First:  As Nick mentioned, take a look at -hexdump- or as Kit mentioned use the OS command 'wc'.

I'll reiterate my suggestion to break the file up in to manageable chunks using a program like -chunky- (from SSC).  I think this is the key to making the file small enough to open in a text editor you are comfortable with (e.g., TW) to manually search for problems.  Once you find the changes you need to make by inspecting these chunks in TW, you can use any solution you want (-filefilter-, OS tools like awk/perl/etc, or just find/replace in TW) to make changes to the file so that it can be imported into Stata.  Finally, you can make these changes to the original text file or make the changes to the 'chunks' of text files and concatenate them back together.


Here's an example of working with a >1GB file by breaking it up into chunks and analyzing it using -chunky-, -hexdump-, or some OS tools.  -hexdump- shows the eol and eof characters that you may need to find/replace with the aforementioned methods.  If you are still having issues with removing the characters of interest, let us know what -hexdump- reports and the char patterns you are attempting to change/replace.  For the example below: (1) you need -appendfile-, -intext-, and -chunky- (from SSC) and (2) this example takes a while (>30 min) to run, depending on your machine, since we're analyzing a large text file.

****************!  example below is for MacOSX:
//1.create fake c.1 GB text file (testfile.txt)
clear
set obs 89410000
g x = runiform()*10
g y = runiform()*10
outsheet using "testfile.txt", replace nonames noquote //go get some coffee...
**for a faster, non-stata based approach, create using *nix tool 'dd':
**!dd if=/dev/urandom of=testfile.txt bs=1073741824 count=1 conv=ascii,lcase


//2.take a look at testfile.txt w/o text editor
chunky using testfile.txt, analyze
    **or**
hexdump testfile.txt, analyze results
 **browse some unexpected chars in file with *nix tool 'tr':
!tr -d '[a-zA-Z0-9!#@_?+ \t\n\\()"^~`%-]'\'{} < testfile.txt  | hexdump -c
 **examine head/tail of file instead of opening it:
chunky using testfile.txt, peek(10)
	**or**
!head -n 10 testfile.txt 
	**or**
!perl -n -e 'print if (1 ... 10)' testfile.txt | less


//3.break up testfile.txt in 100m chunks that can be opened by TW
chunky using testfile.txt, chunksize(300m) stub(part) replace
  **now, find problems in TW manually**


//4.concat back together with -appendfile- or OS-tool 'cat'
forval n = 2/6 {
  appendfile part000`n'.txt  part0001.txt
} 

  **alternatively**
loc concat:dir "`c(pwd)'" files "part*", respectcase nofail
loc concat:subinstr loc concat `"""' "", all
di `"`concat'"'
cap rm testfile_ALLPARTS.txt
!cat `concat'  >> testfile_ALLPARTS.txt



//5.filefilter, etc to make changes (fix two EOL in row)
****examples (watch how Line-end chars change):
hexdump testfile.txt, analyze results
filefilter testfile.txt testfile_FIX1.txt , from(\n\n) to(\n)
filefilter testfile_FIX1.txt testfile_FIX2.txt , from(\n) to(\r)
hexdump testfile_FIX2.txt, analyze results
!perl -i -p -e 's/\n//' testfile_FIX2.txt
hexdump testfile_FIX2.txt, analyze results


// after fixing the file, you can also use -intext- :
clear
intext using testfile.txt, gen(rec) len(240)

*****************!
^watch for line wrapping in the snippet above


- Eric


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index