Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Problem with infix: record too long


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Problem with infix: record too long
Date   Mon, 25 Apr 2011 01:41:09 +0100

Your last question is, in effect, can I explain to you how to read a
binary file with unspecified structure into Stata, and the short
answer is sorry, no.

It's a rare word processor that can open large binary files with
success. Word processors accept a range of formats for documents,
tending to prefer their own proprietary format, but are usually
useless at reading binary data files. A good text editor could do it;
that does not include the proprietary editors bundled with MS Windows.

I wonder if you are being misled by the first line in the help for
-infix- below, while overlooking the second line, which is vital.

"infix reads into memory from a disk dataset that is not in Stata
format.  infix requires
    that the data be in fixed-column format."

As you reported, Stata is seeing far fewer end-of-line character pairs
\r\n than lines in this file, \r and \n characters are occurring by
themselves, which is not standard for text files in MS Windows, and
-hexdump- is labelling this binary. It' s unlikely to be wrong on
that.

You could try just

. type filename.txt

in Stata and that might show you, and us, the first few lines of the
file. They might be recognisable to someone as in a particular format.

I think if you can't get an idea of what the structure of this file
is, then you have no way to read it into Stata. Why a "government
organisation" is providing a binary file and calling a .txt I cannot
explain. You may need to talk to them.

Nick

2011/4/24 Barbara Guimarães <[email protected]>:
> Dear Nick, unfortunetly, I'm not being able to open the file with any
> word processor (I believe that it is because of its size / this
> dataset was provided by an government organization, so I already
> received it in .txt format and don't have access to the primary data)
>
>
> However, the output of the hexdump analyze was:
>
>
>>> . hexdump TS_QUEST_ALUNO.txt, analyze
>
>
>  Line-end characters                                           Line
> length (tab=1)
>
>    \r\n         (Windows)               2,517,361
> minimum                        0
>
>    \r by itself (Mac)                        686,626
> maximum               20,971,542
>
>    \n by itself (Unix)                      768,441
>
>  Space/separator characters                               Number of
> lines           3,972,429
>
>    [blank]                                112,067,613
>  EOL at EOF?                  no
>
>    [tab]                                          707,187
>
>    [comma] (,)                               765,547           Length
> of first 5 lines
>
>  Control characters
>   Line 1                       120
>
>    binary 0                               30,611,037
>  Line 2                       120
>
>    CTL excl. \r, \n, \t                 19,330,367
> Line 3                       120
>
>    DEL                                          367,820
>     Line 4                       120
>
>    Extended (128-159,255)     21,370,596                    Line 5
>                   120
>
>  ASCII printable
>
>    A-Z                                    149,642,323
>
>    a-z                                       16,234,081
> File format                 BINARY
>
>    0-9                                       53,967,247
>
>    Special (!@#$ etc.)              28,963,365
>
>    Extended (160-254)             54,882,559
>
>                          ---------------
>
>  Total                                    495,399,531
>
>
>
>  Observed were:
>
>     \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
>
>     ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
>
>     6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
>
>     Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
>
>     ~ DEL 128 E^A E^B E^C E^D E^E E^F E^G E^H E^I E^J E^K E^L E^M E^N E^O
>
>     E^P E^Q E^R E^S E^T E^U E^V E^W E^X E^Y E^Z 155 156 157 158 159 160 ¡ ¢
>
>     £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
>
>     Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê
>
>     ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ 255
>
>
> Is there any way I could transform this dataset in a way Stata would
> read it entirely?
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index