Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Barbara Guimarães <barbara.vgh@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Problem with infix: record too long |
Date | Mon, 25 Apr 2011 20:49:53 -0300 |
Nick, thanks for your response. Using the type filename.txt as you suggested, Stata showed me the following first lines: type TS_QUEST_ALUNO.txt 1373262421RN24GROSSOS 2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA 1373263421RN24GROSSOS 2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA 1373264421RN24GROSSOS 2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A.. and which than ended as: > ...................................................................................................................................................................................................... > ............................................................................................................................................................c4 ......?.:Z3. .R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J It looks like, to me, that this would be a fixed format. But I might be wrong. regards, Barbara 2011/4/24 Nick Cox <njcoxstata@gmail.com>: > Your last question is, in effect, can I explain to you how to read a > binary file with unspecified structure into Stata, and the short > answer is sorry, no. > > It's a rare word processor that can open large binary files with > success. Word processors accept a range of formats for documents, > tending to prefer their own proprietary format, but are usually > useless at reading binary data files. A good text editor could do it; > that does not include the proprietary editors bundled with MS Windows. > > I wonder if you are being misled by the first line in the help for > -infix- below, while overlooking the second line, which is vital. > > "infix reads into memory from a disk dataset that is not in Stata > format. infix requires > that the data be in fixed-column format." > > As you reported, Stata is seeing far fewer end-of-line character pairs > \r\n than lines in this file, \r and \n characters are occurring by > themselves, which is not standard for text files in MS Windows, and > -hexdump- is labelling this binary. It' s unlikely to be wrong on > that. > > You could try just > > . type filename.txt > > in Stata and that might show you, and us, the first few lines of the > file. They might be recognisable to someone as in a particular format. > > I think if you can't get an idea of what the structure of this file > is, then you have no way to read it into Stata. Why a "government > organisation" is providing a binary file and calling a .txt I cannot > explain. You may need to talk to them. > > Nick > > 2011/4/24 Barbara Guimarães <barbara.vgh@gmail.com>: >> Dear Nick, unfortunetly, I'm not being able to open the file with any >> word processor (I believe that it is because of its size / this >> dataset was provided by an government organization, so I already >> received it in .txt format and don't have access to the primary data) >> >> >> However, the output of the hexdump analyze was: >> >> >>>> . hexdump TS_QUEST_ALUNO.txt, analyze >> >> >> Line-end characters Line >> length (tab=1) >> >> \r\n (Windows) 2,517,361 >> minimum 0 >> >> \r by itself (Mac) 686,626 >> maximum 20,971,542 >> >> \n by itself (Unix) 768,441 >> >> Space/separator characters Number of >> lines 3,972,429 >> >> [blank] 112,067,613 >> EOL at EOF? no >> >> [tab] 707,187 >> >> [comma] (,) 765,547 Length >> of first 5 lines >> >> Control characters >> Line 1 120 >> >> binary 0 30,611,037 >> Line 2 120 >> >> CTL excl. \r, \n, \t 19,330,367 >> Line 3 120 >> >> DEL 367,820 >> Line 4 120 >> >> Extended (128-159,255) 21,370,596 Line 5 >> 120 >> >> ASCII printable >> >> A-Z 149,642,323 >> >> a-z 16,234,081 >> File format BINARY >> >> 0-9 53,967,247 >> >> Special (!@#$ etc.) 28,963,365 >> >> Extended (160-254) 54,882,559 >> >> --------------- >> >> Total 495,399,531 >> >> >> >> Observed were: >> >> \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W >> >> ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 >> >> 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y >> >> Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } >> >> ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O >> >> E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢ >> >> £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ >> >> Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê >> >> ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255 >> >> >> Is there any way I could transform this dataset in a way Stata would >> read it entirely? >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/