Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: problem reading text data into stata
From
Nick Cox <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
st: RE: problem reading text data into stata
Date
Mon, 5 Dec 2011 12:45:08 +0000
I just copied this and did a character count with -hexdump-. It seems as if you have some possibly problematic characters there.
In particular,
133 85 E^E 1
145 91 E^Q 4
146 92 E^R 461
150 96 E^V 1
160 a0 160 1
225 e1 á 1
Otherwise, as you give no commands and no definition of what would be correct, it is difficult to comment on what you should do. But if this were my problem, I would be looking for those characters with a decent text editor.
. hexdump textdata.txt, tabulate
Line-end characters Line length (tab=1)
\r\n (Windows) 19,999 minimum 1
\r by itself (Mac) 19,999 maximum 344
\n by itself (Unix) 0
Space/separator characters Number of lines 39,998
[blank] 602,317 EOL at EOF? yes
[tab] 179,991
[comma] (,) 1,092 Length of first 5 lines
Control characters Line 1 79
binary 0 0 Line 2 1
CTL excl. \r, \n, \t 0 Line 3 126
DEL 0 Line 4 1
Extended (128-159,255) 467 Line 5 127
ASCII printable
A-Z 218,837
a-z 786,411 File format BINARY
0-9 582,152
Special (!@#$ etc.) 172,669
Extended (160-254) 2
---------------
Total 2,603,935
Observed were:
\t \n \r blank ! " # $ & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < > ?
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] a b c d e f g h
i j k l m n o p q r s t u v w x y z { } E^E E^Q E^R E^V 160 á
Tabulation (character not listed if unobserved):
Dec Hex Char Frequency
------------------------------
009 09 \t 179,991
010 0a \n 19,999
013 0d \r 39,998
032 20 blank 602,317
033 21 ! 33
034 22 " 6,251
035 23 # 6
036 24 $ 2
038 26 & 564
039 27 ' 733
040 28 ( 9,062
041 29 ) 9,054
042 2a * 5
043 2b + 1
044 2c , 1,092
045 2d - 2,486
046 2e . 4,850
047 2f / 51,369
048 30 0 221,441
049 31 1 66,881
050 32 2 85,899
051 33 3 71,445
052 34 4 29,256
053 35 5 41,863
054 36 6 18,235
055 37 7 14,778
056 38 8 14,031
057 39 9 18,323
058 3a : 69,648
059 3b ; 18,494
060 3c < 6
062 3e > 6
063 3f ? 15
065 41 A 11,598
066 42 B 20,285
067 43 C 34,650
068 44 D 3,676
069 45 E 20,423
070 46 F 2,984
071 47 G 3,145
072 48 H 2,164
073 49 I 10,586
074 4a J 1,782
075 4b K 1,202
076 4c L 3,880
077 4d M 6,063
078 4e N 42,371
079 4f O 2,161
080 50 P 8,361
081 51 Q 276
082 52 R 4,663
083 53 S 18,057
084 54 T 6,408
085 55 U 4,726
086 56 V 1,495
087 57 W 5,610
088 58 X 695
089 59 Y 1,218
090 5a Z 358
091 5b [ 2
093 5d ] 4
097 61 a 57,880
098 62 b 4,090
099 63 c 22,449
100 64 d 16,106
101 65 e 103,153
102 66 f 4,041
103 67 g 35,055
104 68 h 13,928
105 69 i 69,629
106 6a j 107
107 6b k 5,893
108 6c l 28,815
109 6d m 45,755
110 6e n 77,699
111 6f o 52,881
112 70 p 30,897
113 71 q 3,446
114 72 r 53,480
115 73 s 46,056
116 74 t 38,425
117 75 u 15,202
118 76 v 25,072
119 77 w 26,044
120 78 x 2,300
121 79 y 6,875
122 7a z 1,133
123 7b { 39
125 7d } 39
133 85 E^E 1
145 91 E^Q 4
146 92 E^R 461
150 96 E^V 1
160 a0 160 1
225 e1 á 1
------------------------------
Total 2,603,935
Nick
[email protected]
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of David Stromberg
Sent: 05 December 2011 11:18
To: [email protected]
Subject: st: problem reading text data into stata
On a number of occasions, I have had problems reading tab-delimited
text (string) data into Stata. For example, I cannot get stata to
correctly read the tab-separated text file at
http://people.su.se/~dstro/textdata.txt
I tried opening it in Excel and resaving, saving as csv, identifying and
eliminating characters which makes Stata misread, etc. Either some lines
are missing, or some text is incorrect, e.g. text within parenthesis.
Any ideas no how to proceed?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/