Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Problem with infix: record too long


From   DE SOUZA Eric <eric.de_souza@coleurope.eu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Problem with infix: record too long
Date   Tue, 26 Apr 2011 09:54:40 +0200

The end of the file indicates that it is not a text file but a binary file. When you open a binary file in a text editor that is the kind of stuff you get.


Eric de Souza
College of Europe
Brugge (Bruges), Belgium
http://www.coleurope.eu


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Barbara Guimarães
Sent: 26 April 2011 01:50
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Problem with infix: record too long

Nick, thanks for your response.

Using the type filename.txt as you suggested, Stata showed me the following first lines:

 type TS_QUEST_ALUNO.txt
 1373262421RN24GROSSOS
2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA
 1373263421RN24GROSSOS
2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA
 1373264421RN24GROSSOS
2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A..

and which than ended as:

> ......................................................................................................................................................................................................
> ............................................................................................................................................................c4 ......?.:Z3.
.R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J

It looks like, to me, that this would be a fixed format. But I might be wrong.

regards,
Barbara

2011/4/24 Nick Cox <njcoxstata@gmail.com>:
> Your last question is, in effect, can I explain to you how to read a 
> binary file with unspecified structure into Stata, and the short 
> answer is sorry, no.
>
> It's a rare word processor that can open large binary files with 
> success. Word processors accept a range of formats for documents, 
> tending to prefer their own proprietary format, but are usually 
> useless at reading binary data files. A good text editor could do it; 
> that does not include the proprietary editors bundled with MS Windows.
>
> I wonder if you are being misled by the first line in the help for
> -infix- below, while overlooking the second line, which is vital.
>
> "infix reads into memory from a disk dataset that is not in Stata 
> format.  infix requires
>    that the data be in fixed-column format."
>
> As you reported, Stata is seeing far fewer end-of-line character pairs 
> \r\n than lines in this file, \r and \n characters are occurring by 
> themselves, which is not standard for text files in MS Windows, and
> -hexdump- is labelling this binary. It' s unlikely to be wrong on 
> that.
>
> You could try just
>
> . type filename.txt
>
> in Stata and that might show you, and us, the first few lines of the 
> file. They might be recognisable to someone as in a particular format.
>
> I think if you can't get an idea of what the structure of this file 
> is, then you have no way to read it into Stata. Why a "government 
> organisation" is providing a binary file and calling a .txt I cannot 
> explain. You may need to talk to them.
>
> Nick
>
> 2011/4/24 Barbara Guimarães <barbara.vgh@gmail.com>:
>> Dear Nick, unfortunetly, I'm not being able to open the file with any 
>> word processor (I believe that it is because of its size / this 
>> dataset was provided by an government organization, so I already 
>> received it in .txt format and don't have access to the primary data)
>>
>>
>> However, the output of the hexdump analyze was:
>>
>>
>>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>>
>>
>>  Line-end characters                                           Line 
>> length (tab=1)
>>
>>    \r\n         (Windows)               2,517,361 minimum                        
>> 0
>>
>>    \r by itself (Mac)                        686,626 maximum               
>> 20,971,542
>>
>>    \n by itself (Unix)                      768,441
>>
>>  Space/separator characters                               Number of 
>> lines           3,972,429
>>
>>    [blank]                                112,067,613
>>  EOL at EOF?                  no
>>
>>    [tab]                                          707,187
>>
>>    [comma] (,)                               765,547           Length 
>> of first 5 lines
>>
>>  Control characters
>>   Line 1                       120
>>
>>    binary 0                               30,611,037
>>  Line 2                       120
>>
>>    CTL excl. \r, \n, \t                 19,330,367 Line 3                       
>> 120
>>
>>    DEL                                          367,820
>>     Line 4                       120
>>
>>    Extended (128-159,255)     21,370,596                    Line 5
>>                   120
>>
>>  ASCII printable
>>
>>    A-Z                                    149,642,323
>>
>>    a-z                                       16,234,081 File format                 
>> BINARY
>>
>>    0-9                                       53,967,247
>>
>>    Special (!@#$ etc.)              28,963,365
>>
>>    Extended (160-254)             54,882,559
>>
>>                          ---------------
>>
>>  Total                                    495,399,531
>>
>>
>>
>>  Observed were:
>>
>>     \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U 
>> ^V ^W
>>
>>     ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 
>> 2 3 4 5
>>
>>     6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V 
>> W X Y
>>
>>     Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z 
>> { | }
>>
>>     ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N 
>> E^O
>>
>>     E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 
>> 160 ¡ ¢
>>
>>     £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã 
>> Ä Å Æ
>>
>>     Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç 
>> è é ê
>>
>>     ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>>
>>
>> Is there any way I could transform this dataset in a way Stata would 
>> read it entirely?
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index