Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "David Radwin" <dradwin@mprinc.com> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: RE: Getting rid of binary codes so I can read in files - reposted |
Date | Wed, 18 Jan 2012 08:32:32 -0800 (PST) |
Orian, I've never used it myself, but you might try Google Refine: http://www.stata.com/statalist/archive/2010-11/msg00858.html http://code.google.com/p/google-refine/ Please let us know if it works for you or not. David -- David Radwin Research Associate MPR Associates, Inc. 2150 Shattuck Ave., Suite 800 Berkeley, CA 94704 Phone: 510-849-4942 Fax: 510-849-0794 www.mprinc.com > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu [mailto:owner- > statalist@hsphsun2.harvard.edu] On Behalf Of Orian Brook > Sent: Wednesday, January 18, 2012 6:40 AM > To: statalist@hsphsun2.harvard.edu > Subject: st: Getting rid of binary codes so I can read in files - reposted > > Not lucky enough to have had any replies so far - is there anyone with any > suggestions, or shall I just revert to Outlook? > Thanks > Orian > > Dear all > I'm analysing administrative data which I've had to export using an online > database into 105 files. I've previously worked with similar files by > importing and combining them all in Outlook, then reading into stata using > an odbc link, but I'd really like to try to do it all in stata (so I have > the do file for repetition/audit trail purposes) but I have some problems. > The original files has extra EOL characters, and extended ones, which I > can > get rid of using filefilter, but I still can't import the file: using > insheet I get the correct number of rows and columns, but all cells are > blank except the first (it has a t in it). I've also tried using infile > and > skipping the first line, to no avail. Running hexdump shows that I have > over > 2million binary 0s, which I think may be the problem? I tried using the > command "filefilter file1 file2, from(\00hd) to() replace" to get rid of > them, but it hangs. > > Any help would be very gratefully received. The hexdump is below. > (apologies, plain text format doesn't allow me to post this in courier or > something more legible) > > Regards > Orian Brook > > Line-end characters Line length (tab=1) > \r\n (Windows) 26,823 minimum 2 > \r by itself (Mac) 0 maximum 403 > \n by itself (Unix) 0 > Space/separator characters Number of lines 26,824 > [blank] 107,191 EOL at EOF? no > [tab] 0 > [comma] (,) 509,637 Length of first 5 lines > Control characters Line 1 403 > binary 0 2,747,580 Line 2 185 > CTL excl. \r, \n, \t 0 Line 3 243 > DEL 0 Line 4 245 > Extended (128-159,255) 0 Line 5 245 > ASCII printable > A-Z 189,766 > a-z 189,754 File format BINARY > 0-9 1,509,729 > Special (!@#$ etc.) 187,857 > Extended (160-254) 0 > --------------- > Total 5,495,160 > Observed were: > \0 \n \r blank , - . / 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M > N > O > P Q R S T U V W X Y Z _ a b c d e f g h i k l m n o p q r s t u v x y * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/