Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: mac editor for big files
From
Richard Goldstein <[email protected]>
To
[email protected]
Subject
Re: st: mac editor for big files
Date
Mon, 22 Aug 2011 09:19:47 -0400
problem apparently solved as follows:
1. use macvim (thanks Phil and Eric) to see how large the files were as
a check on the import
2. used -hexdump- (thanks Nick) to find apparent problem (files included
3 different end-of-line characters
3. used -filefilter- (thanks Eric) to clean this problem up
4. used StatTransfer to import the files (even after the above,
-insheet- gave me too many variables and too few lines; probably an
error on my part but using StatTransfer was easier than trying to figure
the error out)
Rich
On 8/21/11 11:22 PM, Eric Booth wrote:
> <>
> On Aug 21, 2011, at 10:02 AM, Richard Goldstein wrote:
>> the reason for wanting an editor for these files is that Stata will not
>> correctly import them using -insheet- (or anything else I tried) due to
>> "problems" in the data (e.g., one file had 2 successive end-of-line
>> characters in the middle of what was supposed to be a line of data); so
>> I want to look at the small number of files with apparent problems to
>> see if I can fix them prior to importing; since I don't know what the
>> problem is or even where it occurs, I have been unable to figure out how
>> to use -filefilter-
>
> What commands did you try? -- what were the issues you had with -filefilter- and -insheet-? Did you try importing the text file with -intext- (from SSC)? I've had luck with importing these types of files into Stata with -intext-, even with problematic files.
>
>>
>> I downloaded micvim […] micvim appears to work well, though I still
>> haven't figured out how to get it to show what are ordinarily "hidden"
>> characters (e.g., tab, eol);
>
> Take a look at these suggestions for macvim for the issue you describe: http://superuser.com/questions/249289/display-invisible-characters-in-vim
>
>> I need to see these to "fix" at least two
>> files; however, just opening the files solved one problem for me: it
>> told me how many lines of data there were supposed to be so I can check
>> the results of importing
>
>
> First: As Nick mentioned, take a look at -hexdump- or as Kit mentioned use the OS command 'wc'.
>
> I'll reiterate my suggestion to break the file up in to manageable chunks using a program like -chunky- (from SSC). I think this is the key to making the file small enough to open in a text editor you are comfortable with (e.g., TW) to manually search for problems. Once you find the changes you need to make by inspecting these chunks in TW, you can use any solution you want (-filefilter-, OS tools like awk/perl/etc, or just find/replace in TW) to make changes to the file so that it can be imported into Stata. Finally, you can make these changes to the original text file or make the changes to the 'chunks' of text files and concatenate them back together.
>
>
> Here's an example of working with a >1GB file by breaking it up into chunks and analyzing it using -chunky-, -hexdump-, or some OS tools. -hexdump- shows the eol and eof characters that you may need to find/replace with the aforementioned methods. If you are still having issues with removing the characters of interest, let us know what -hexdump- reports and the char patterns you are attempting to change/replace. For the example below: (1) you need -appendfile-, -intext-, and -chunky- (from SSC) and (2) this example takes a while (>30 min) to run, depending on your machine, since we're analyzing a large text file.
>
> ****************! example below is for MacOSX:
> //1.create fake c.1 GB text file (testfile.txt)
> clear
> set obs 89410000
> g x = runiform()*10
> g y = runiform()*10
> outsheet using "testfile.txt", replace nonames noquote //go get some coffee...
> **for a faster, non-stata based approach, create using *nix tool 'dd':
> **!dd if=/dev/urandom of=testfile.txt bs=1073741824 count=1 conv=ascii,lcase
>
>
> //2.take a look at testfile.txt w/o text editor
> chunky using testfile.txt, analyze
> **or**
> hexdump testfile.txt, analyze results
> **browse some unexpected chars in file with *nix tool 'tr':
> !tr -d '[a-zA-Z0-9!#@_?+ \t\n\\()"^~`%-]'\'{} < testfile.txt | hexdump -c
> **examine head/tail of file instead of opening it:
> chunky using testfile.txt, peek(10)
> **or**
> !head -n 10 testfile.txt
> **or**
> !perl -n -e 'print if (1 ... 10)' testfile.txt | less
>
>
> //3.break up testfile.txt in 100m chunks that can be opened by TW
> chunky using testfile.txt, chunksize(300m) stub(part) replace
> **now, find problems in TW manually**
>
>
> //4.concat back together with -appendfile- or OS-tool 'cat'
> forval n = 2/6 {
> appendfile part000`n'.txt part0001.txt
> }
>
> **alternatively**
> loc concat:dir "`c(pwd)'" files "part*", respectcase nofail
> loc concat:subinstr loc concat `"""' "", all
> di `"`concat'"'
> cap rm testfile_ALLPARTS.txt
> !cat `concat' >> testfile_ALLPARTS.txt
>
>
>
> //5.filefilter, etc to make changes (fix two EOL in row)
> ****examples (watch how Line-end chars change):
> hexdump testfile.txt, analyze results
> filefilter testfile.txt testfile_FIX1.txt , from(\n\n) to(\n)
> filefilter testfile_FIX1.txt testfile_FIX2.txt , from(\n) to(\r)
> hexdump testfile_FIX2.txt, analyze results
> !perl -i -p -e 's/\n//' testfile_FIX2.txt
> hexdump testfile_FIX2.txt, analyze results
>
>
> // after fixing the file, you can also use -intext- :
> clear
> intext using testfile.txt, gen(rec) len(240)
>
> *****************!
> ^watch for line wrapping in the snippet above
>
>
> - Eric
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/