Search
>> Home >> Resources & support >> FAQs >> Malformed end-of-line sequences

### Why do I get rows of missing data when I use infile?

 Title Malformed end-of-line sequences Author James Hassell, StataCorp Date December 2003; minor revisions August 2007

Sometimes when you use Stata’s infile command, Stata reads data that yield alternating missing rows. For example, consider the following sample dataset, which includes its own dictionary:

 dictionary {
str18  make              "Make and Model"'
int    price             "Price"'
int    mpg               "Mileage (mpg)"'
int    rep78             "Repair Record 1978"'
int    trunk             "Trunk space (cu. ft.)"'
int    weight            "Weight (lbs.)"'
}
"AMC Concord"             4099        22         3     2.5        11    2930
"AMC Pacer"               4749        17         3     3.0        11    3350
"AMC Spirit"              3799        22         .     3.0        12    2640
"Buick Century"           4816        20         3     4.5        16    3250
"Buick Electra"           7827        15         4     4.0        20    4080
"Buick LeSabre"           5788        18         3     4.0        21    3670


You would probably expect that the data in the file could be read with Stata’s infile command and that, subsequently, the data in memory would contain 6 rows. However, after Stata reads the data, you might be surprised to find the following output:

 . infile using auto.raw, clear
(output omitted)

. list

+-----------------------------------------------------------------+
|          make   price   mpg   rep78   headroom   trunk   weight |
|-----------------------------------------------------------------|
1. |                     .     .       .          .       .        . |
2. |   AMC Concord    4099    22       3        2.5      11     2930 |
3. |                     .     .       .          .       .        . |
4. |     AMC Pacer    4749    17       3          3      11     3350 |
5. |                     .     .       .          .       .        . |
|-----------------------------------------------------------------|
6. |    AMC Spirit    3799    22       .          3      12     2640 |
7. |                     .     .       .          .       .        . |
8. | Buick Century    4816    20       3        4.5      16     3250 |
9. |                     .     .       .          .       .        . |
10. | Buick Electra    7827    15       4          4      20     4080 |
|-----------------------------------------------------------------|
11. |                     .     .       .          .       .        . |
12. | Buick LeSabre    5788    18       3          4      21     3670 |
13. |                     .     .       .          .       .        . |
+-----------------------------------------------------------------+


At this point, you are probably wondering what has happened. The key to knowing what caused this behavior is to understand the end-of-line (EOL) characters on various platforms. Stata can safely and accurately read raw data that has valid Windows, Macintosh, or Unix EOL markers. The unexpected behavior encountered in the example above can be explained by the malformed EOL sequences contained in our test file (auto.raw). Valid EOL sequences from all three formats are listed in the table below:

Platform Characters ASCII Codes
Macintosh \r 13
Unix \n 10
Windows \r\n 13 10

As mentioned above, the file named auto.raw contained invalid EOL sequences. Here is the EOL sequence found in our test file: "\r\r\n". As you can see, the pattern does not match any of the three valid EOL sequences.

### The solution

Stata has a command called hexdump, which can read and analyze raw binary data. Using hexdump with its analyze option displays some of the normally hidden attributes associated with a text file. For example,

 . hexdump auto.raw, analyze

Line-end characters                        Line length (tab=1)
\r\n         (Windows)             15      minimum                        1
\r by itself (Mac)                 15      maximum                       77
\n by itself (Unix)                 0
Space/separator characters                 Number of lines                 30
[blank]                           466      EOL at EOF?                  yes
[tab]                               0
[comma] (,)                         0    Length of first 5 lines
Control characters                           Line 1                        13
binary 0                            0      Line 2                         1
CTL excl. \r, \n, \t                0      Line 3                        52
DEL                                 0      Line 4                         1
Extended (128-159,255)              0      Line 5                        43
ASCII printable
A-Z                                28
a-z                               174    File format                  ASCII
0-9                                97
Special (!@#$etc.) 61 Extended (160-254) 0 --------------- Total 871 Observed were: \n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W  a b c d e f g h i k l m n o p r s t u w y { }  In the output above, we can see that there are both Windows and Macintosh EOL characters present. We can use Stata’s filefilter command to strip out the unwanted Macintosh EOL characters (i.e., the first \r in the \r\r\n sequence). For example,  . filefilter auto.raw auto2.raw, from(\r\r) to(\r) replace . hexdump auto2.raw, analyze Line-end characters Line length (tab=1) \r\n (Windows) 15 minimum 2 \r by itself (Mac) 0 maximum 77 \n by itself (Unix) 0 Space/separator characters Number of lines 15 [blank] 466 EOL at EOF? yes [tab] 0 [comma] (,) 0 Length of first 5 lines Control characters Line 1 13 binary 0 0 Line 2 52 CTL excl. \r, \n, \t 0 Line 3 43 DEL 0 Line 4 51 Extended (128-159,255) 0 Line 5 56 ASCII printable A-Z 28 a-z 174 File format ASCII 0-9 97 Special (!@#$ etc.)                61
Extended (160-254)                  0
---------------
Total                               856

Observed were:
\n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W  a b
c d e f g h i k l m n o p r s t u w y { }


We can see that all "\r\r" sequences were replaced by "\r", which yields a new "\r\n". Now our file contains valid Windows EOL sequences.

The new file, named auto2.raw, can now be read into Stata with its accompanying data dictionary by using the infile command. For example,

 . infile using auto2.raw, clear
(output omitted)

. list

+-----------------------------------------------------------------+
|          make   price   mpg   rep78   headroom   trunk   weight |
|-----------------------------------------------------------------|
1. |   AMC Concord    4099    22       3        2.5      11     2930 |
2. |     AMC Pacer    4749    17       3          3      11     3350 |
3. |    AMC Spirit    3799    22       .          3      12     2640 |
4. | Buick Century    4816    20       3        4.5      16     3250 |
5. | Buick Electra    7827    15       4          4      20     4080 |
|-----------------------------------------------------------------|
6. | Buick LeSabre    5788    18       3          4      21     3670 |
+-----------------------------------------------------------------+
`