Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Problem with infix: record too long


From   Daniel Marcelino <dmsilva.br@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Problem with infix: record too long
Date   Tue, 26 Apr 2011 03:17:05 -0300

Maybe you can try open your file using softwares like TextMate. I have
ben used it to open every kind of text file. Including large files as
500 MB.
So, worth a try.

Daniel

On Mon, Apr 25, 2011 at 9:11 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> Fixed format or not, I can't see a way for Stata to make sense out of that.
>
> It's not uncommon for datafiles to start with some kind of preamble.
> But this seems to start with some data. Also, the end looks quite
> unlike the beginning, as might be guessed from the -hexdump- report.
>
> Unless you can give more information on what should be inside --
> you've not said, but you should know -- or someone recognises this
> stuff, I think you need to ask those people what kind of beast they
> sent.
>
> 2011/4/26 Barbara Guimarães <barbara.vgh@gmail.com>:
>> Nick, thanks for your response.
>>
>> Using the type filename.txt as you suggested, Stata showed me the
>> following first lines:
>>
>>  type TS_QUEST_ALUNO.txt
>>  1373262421RN24GROSSOS
>> 2404408ADCDBAAACABDCABCEAAAAAAAAAAAAAAC*CBAABAAAAAA
>>  1373263421RN24GROSSOS
>> 2404408BDKEAAAABABDCADBDACAAAB.AAAAAAAABBBBACAAAAAA
>>  1373264421RN24GROSSOS
>> 2404408BAAACBAAB..DCADCDAAAAAABBAAAAAAAAABCAAAA.A..
>>
>> and which than ended as:
>>
>>> ......................................................................................................................................................................................................
>>> ............................................................................................................................................................c4 ......?.:Z3.
>> .R...x.9..........T.Np(0$%'...@#../q..'!m.t.F2$*J
>>
>> It looks like, to me, that this would be a fixed format. But I might be wrong.
>>
>> regards,
>> Barbara
>>
>> 2011/4/24 Nick Cox <njcoxstata@gmail.com>:
>>> Your last question is, in effect, can I explain to you how to read a
>>> binary file with unspecified structure into Stata, and the short
>>> answer is sorry, no.
>>>
>>> It's a rare word processor that can open large binary files with
>>> success. Word processors accept a range of formats for documents,
>>> tending to prefer their own proprietary format, but are usually
>>> useless at reading binary data files. A good text editor could do it;
>>> that does not include the proprietary editors bundled with MS Windows.
>>>
>>> I wonder if you are being misled by the first line in the help for
>>> -infix- below, while overlooking the second line, which is vital.
>>>
>>> "infix reads into memory from a disk dataset that is not in Stata
>>> format.  infix requires
>>>    that the data be in fixed-column format."
>>>
>>> As you reported, Stata is seeing far fewer end-of-line character pairs
>>> \r\n than lines in this file, \r and \n characters are occurring by
>>> themselves, which is not standard for text files in MS Windows, and
>>> -hexdump- is labelling this binary. It' s unlikely to be wrong on
>>> that.
>>>
>>> You could try just
>>>
>>> . type filename.txt
>>>
>>> in Stata and that might show you, and us, the first few lines of the
>>> file. They might be recognisable to someone as in a particular format.
>>>
>>> I think if you can't get an idea of what the structure of this file
>>> is, then you have no way to read it into Stata. Why a "government
>>> organisation" is providing a binary file and calling a .txt I cannot
>>> explain. You may need to talk to them.
>>>
>>> Nick
>>>
>>> 2011/4/24 Barbara Guimarães <barbara.vgh@gmail.com>:
>>>> Dear Nick, unfortunetly, I'm not being able to open the file with any
>>>> word processor (I believe that it is because of its size / this
>>>> dataset was provided by an government organization, so I already
>>>> received it in .txt format and don't have access to the primary data)
>>>>
>>>>
>>>> However, the output of the hexdump analyze was:
>>>>
>>>>
>>>>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>>>>
>>>>
>>>>  Line-end characters                                           Line
>>>> length (tab=1)
>>>>
>>>>    \r\n         (Windows)               2,517,361
>>>> minimum                        0
>>>>
>>>>    \r by itself (Mac)                        686,626
>>>> maximum               20,971,542
>>>>
>>>>    \n by itself (Unix)                      768,441
>>>>
>>>>  Space/separator characters                               Number of
>>>> lines           3,972,429
>>>>
>>>>    [blank]                                112,067,613
>>>>  EOL at EOF?                  no
>>>>
>>>>    [tab]                                          707,187
>>>>
>>>>    [comma] (,)                               765,547           Length
>>>> of first 5 lines
>>>>
>>>>  Control characters
>>>>   Line 1                       120
>>>>
>>>>    binary 0                               30,611,037
>>>>  Line 2                       120
>>>>
>>>>    CTL excl. \r, \n, \t                 19,330,367
>>>> Line 3                       120
>>>>
>>>>    DEL                                          367,820
>>>>     Line 4                       120
>>>>
>>>>    Extended (128-159,255)     21,370,596                    Line 5
>>>>                   120
>>>>
>>>>  ASCII printable
>>>>
>>>>    A-Z                                    149,642,323
>>>>
>>>>    a-z                                       16,234,081
>>>> File format                 BINARY
>>>>
>>>>    0-9                                       53,967,247
>>>>
>>>>    Special (!@#$ etc.)              28,963,365
>>>>
>>>>    Extended (160-254)             54,882,559
>>>>
>>>>                          ---------------
>>>>
>>>>  Total                                    495,399,531
>>>>
>>>>
>>>>
>>>>  Observed were:
>>>>
>>>>     \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
>>>>
>>>>     ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
>>>>
>>>>     6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
>>>>
>>>>     Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
>>>>
>>>>     ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O
>>>>
>>>>     E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢
>>>>
>>>>     £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
>>>>
>>>>     Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê
>>>>
>>>>     ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>>>>
>>>>
>>>> Is there any way I could transform this dataset in a way Stata would
>>>> read it entirely?
>>>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
Daniel Marcelino
http://danielmarcelino.zip.net
Skype: dmsilva.br

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index