Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Problem with infix: record too long


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Problem with infix: record too long
Date   Tue, 26 Apr 2011 01:11:32 +0100

Fixed format or not, I can't see a way for Stata to make sense out of that.

It's not uncommon for datafiles to start with some kind of preamble.
But this seems to start with some data. Also, the end looks quite
unlike the beginning, as might be guessed from the -hexdump- report.

Unless you can give more information on what should be inside --
you've not said, but you should know -- or someone recognises this
stuff, I think you need to ask those people what kind of beast they
sent.

2011/4/26 Barbara Guimarães <barbara.vgh@gmail.com>:
> Nick, thanks for your response.
>
> Using the type filename.txt as you suggested, Stata showed me the
> following first lines:
>
>  type TS_QUEST_ALUNO.txt
>  1373262421RN24GROSSOS
> 2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA
>  1373263421RN24GROSSOS
> 2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA
>  1373264421RN24GROSSOS
> 2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A..
>
> and which than ended as:
>
>> ......................................................................................................................................................................................................
>> ............................................................................................................................................................c4 ......?.:Z3.
> .R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J
>
> It looks like, to me, that this would be a fixed format. But I might be wrong.
>
> regards,
> Barbara
>
> 2011/4/24 Nick Cox <njcoxstata@gmail.com>:
>> Your last question is, in effect, can I explain to you how to read a
>> binary file with unspecified structure into Stata, and the short
>> answer is sorry, no.
>>
>> It's a rare word processor that can open large binary files with
>> success. Word processors accept a range of formats for documents,
>> tending to prefer their own proprietary format, but are usually
>> useless at reading binary data files. A good text editor could do it;
>> that does not include the proprietary editors bundled with MS Windows.
>>
>> I wonder if you are being misled by the first line in the help for
>> -infix- below, while overlooking the second line, which is vital.
>>
>> "infix reads into memory from a disk dataset that is not in Stata
>> format.  infix requires
>>    that the data be in fixed-column format."
>>
>> As you reported, Stata is seeing far fewer end-of-line character pairs
>> \r\n than lines in this file, \r and \n characters are occurring by
>> themselves, which is not standard for text files in MS Windows, and
>> -hexdump- is labelling this binary. It' s unlikely to be wrong on
>> that.
>>
>> You could try just
>>
>> . type filename.txt
>>
>> in Stata and that might show you, and us, the first few lines of the
>> file. They might be recognisable to someone as in a particular format.
>>
>> I think if you can't get an idea of what the structure of this file
>> is, then you have no way to read it into Stata. Why a "government
>> organisation" is providing a binary file and calling a .txt I cannot
>> explain. You may need to talk to them.
>>
>> Nick
>>
>> 2011/4/24 Barbara Guimarães <barbara.vgh@gmail.com>:
>>> Dear Nick, unfortunetly, I'm not being able to open the file with any
>>> word processor (I believe that it is because of its size / this
>>> dataset was provided by an government organization, so I already
>>> received it in .txt format and don't have access to the primary data)
>>>
>>>
>>> However, the output of the hexdump analyze was:
>>>
>>>
>>>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>>>
>>>
>>>  Line-end characters                                           Line
>>> length (tab=1)
>>>
>>>    \r\n         (Windows)               2,517,361
>>> minimum                        0
>>>
>>>    \r by itself (Mac)                        686,626
>>> maximum               20,971,542
>>>
>>>    \n by itself (Unix)                      768,441
>>>
>>>  Space/separator characters                               Number of
>>> lines           3,972,429
>>>
>>>    [blank]                                112,067,613
>>>  EOL at EOF?                  no
>>>
>>>    [tab]                                          707,187
>>>
>>>    [comma] (,)                               765,547           Length
>>> of first 5 lines
>>>
>>>  Control characters
>>>   Line 1                       120
>>>
>>>    binary 0                               30,611,037
>>>  Line 2                       120
>>>
>>>    CTL excl. \r, \n, \t                 19,330,367
>>> Line 3                       120
>>>
>>>    DEL                                          367,820
>>>     Line 4                       120
>>>
>>>    Extended (128-159,255)     21,370,596                    Line 5
>>>                   120
>>>
>>>  ASCII printable
>>>
>>>    A-Z                                    149,642,323
>>>
>>>    a-z                                       16,234,081
>>> File format                 BINARY
>>>
>>>    0-9                                       53,967,247
>>>
>>>    Special (!@#$ etc.)              28,963,365
>>>
>>>    Extended (160-254)             54,882,559
>>>
>>>                          ---------------
>>>
>>>  Total                                    495,399,531
>>>
>>>
>>>
>>>  Observed were:
>>>
>>>     \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
>>>
>>>     ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
>>>
>>>     6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
>>>
>>>     Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
>>>
>>>     ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O
>>>
>>>     E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢
>>>
>>>     £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
>>>
>>>     Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê
>>>
>>>     ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>>>
>>>
>>> Is there any way I could transform this dataset in a way Stata would
>>> read it entirely?
>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index