Why do I get rows of missing data when I use infile?
|
Title
|
|
Malformed end-of-line sequences
|
|
Author
|
James Hassell, StataCorp
|
|
Date
|
December 2003; minor revisions August 2007
|
Sometimes when you use Stata’s infile command, Stata reads data
that yield alternating missing rows. For example, consider the following
sample dataset, which includes its own dictionary:
dictionary {
str18 make `"Make and Model"'
int price `"Price"'
int mpg `"Mileage (mpg)"'
int rep78 `"Repair Record 1978"'
float headroom `"Headroom (in.)"'
int trunk `"Trunk space (cu. ft.)"'
int weight `"Weight (lbs.)"'
}
"AMC Concord" 4099 22 3 2.5 11 2930
"AMC Pacer" 4749 17 3 3.0 11 3350
"AMC Spirit" 3799 22 . 3.0 12 2640
"Buick Century" 4816 20 3 4.5 16 3250
"Buick Electra" 7827 15 4 4.0 20 4080
"Buick LeSabre" 5788 18 3 4.0 21 3670
You would probably expect that the data in the file could be read with
Stata’s infile command and that, subsequently, the data in
memory would contain 6 rows. However, after Stata reads the data, you might
be surprised to find the following output:
. infile using auto.raw, clear
(output omitted)
. list
+-----------------------------------------------------------------+
| make price mpg rep78 headroom trunk weight |
|-----------------------------------------------------------------|
1. | . . . . . . |
2. | AMC Concord 4099 22 3 2.5 11 2930 |
3. | . . . . . . |
4. | AMC Pacer 4749 17 3 3 11 3350 |
5. | . . . . . . |
|-----------------------------------------------------------------|
6. | AMC Spirit 3799 22 . 3 12 2640 |
7. | . . . . . . |
8. | Buick Century 4816 20 3 4.5 16 3250 |
9. | . . . . . . |
10. | Buick Electra 7827 15 4 4 20 4080 |
|-----------------------------------------------------------------|
11. | . . . . . . |
12. | Buick LeSabre 5788 18 3 4 21 3670 |
13. | . . . . . . |
+-----------------------------------------------------------------+
At this point, you are probably wondering what has happened. The key to
knowing what caused this behavior is to understand the end-of-line (EOL)
characters on various platforms. Stata can safely and accurately read raw
data that has valid Windows, Macintosh, or Unix EOL markers. The unexpected
behavior encountered in the example above can be explained by the malformed
EOL sequences contained in our test file (auto.raw). Valid EOL
sequences from all three formats are listed in the table below:
|
Platform |
Characters |
ASCII Codes |
| Macintosh |
\r |
13 |
| Unix |
\n |
10 |
| Windows |
\r\n |
13 10 |
As mentioned above, the file named auto.raw contained invalid EOL
sequences. Here is the EOL sequence found in our test file:
"\r\r\n". As you can see, the pattern does not match any of the three
valid EOL sequences.
The solution
Stata has a command called hexdump, which can read and analyze raw
binary data. Using hexdump with its analyze option displays
some of the normally hidden attributes associated with a text file. For
example,
. hexdump auto.raw, analyze
Line-end characters Line length (tab=1)
\r\n (Windows) 15 minimum 1
\r by itself (Mac) 15 maximum 77
\n by itself (Unix) 0
Space/separator characters Number of lines 30
[blank] 466 EOL at EOF? yes
[tab] 0
[comma] (,) 0 Length of first 5 lines
Control characters Line 1 13
binary 0 0 Line 2 1
CTL excl. \r, \n, \t 0 Line 3 52
DEL 0 Line 4 1
Extended (128-159,255) 0 Line 5 43
ASCII printable
A-Z 28
a-z 174 File format ASCII
0-9 97
Special (!@#$ etc.) 61
Extended (160-254) 0
---------------
Total 871
Observed were:
\n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b
c d e f g h i k l m n o p r s t u w y { }
In the output above, we can see that there are both Windows and Macintosh
EOL characters present. We can use Stata’s filefilter command
to strip out the unwanted Macintosh EOL characters (i.e., the first
\r in the \r\r\n sequence). For example,
. filefilter auto.raw auto2.raw, from(\r\r) to(\r) replace
. hexdump auto2.raw, analyze
Line-end characters Line length (tab=1)
\r\n (Windows) 15 minimum 2
\r by itself (Mac) 0 maximum 77
\n by itself (Unix) 0
Space/separator characters Number of lines 15
[blank] 466 EOL at EOF? yes
[tab] 0
[comma] (,) 0 Length of first 5 lines
Control characters Line 1 13
binary 0 0 Line 2 52
CTL excl. \r, \n, \t 0 Line 3 43
DEL 0 Line 4 51
Extended (128-159,255) 0 Line 5 56
ASCII printable
A-Z 28
a-z 174 File format ASCII
0-9 97
Special (!@#$ etc.) 61
Extended (160-254) 0
---------------
Total 856
Observed were:
\n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b
c d e f g h i k l m n o p r s t u w y { }
We can see that all "\r\r" sequences were replaced by "\r",
which yields a new "\r\n". Now our file contains valid Windows EOL
sequences.
The new file, named auto2.raw, can now be read into Stata with its
accompanying data dictionary by using the infile command. For
example,
. infile using auto2.raw, clear
(output omitted)
. list
+-----------------------------------------------------------------+
| make price mpg rep78 headroom trunk weight |
|-----------------------------------------------------------------|
1. | AMC Concord 4099 22 3 2.5 11 2930 |
2. | AMC Pacer 4749 17 3 3 11 3350 |
3. | AMC Spirit 3799 22 . 3 12 2640 |
4. | Buick Century 4816 20 3 4.5 16 3250 |
5. | Buick Electra 7827 15 4 4 20 4080 |
|-----------------------------------------------------------------|
6. | Buick LeSabre 5788 18 3 4 21 3670 |
+-----------------------------------------------------------------+
|