Home  /  Resources & support  /  FAQs  /  Malformed end-of-line sequences

Why do I get rows of missing data when I use infile?

Title   Malformed end-of-line sequences
Author James Hassell, StataCorp

Sometimes when you use Stata’s infile command, Stata reads data that yield alternating missing rows. For example, consider the following sample dataset, which includes its own dictionary:

 dictionary {
         str18  make              `"Make and Model"'
         int    price             `"Price"'
         int    mpg               `"Mileage (mpg)"'
         int    rep78             `"Repair Record 1978"'
         float  headroom          `"Headroom (in.)"'
         int    trunk             `"Trunk space (cu. ft.)"'
         int    weight            `"Weight (lbs.)"'
 } 
 "AMC Concord"             4099        22         3     2.5        11    2930
 "AMC Pacer"               4749        17         3     3.0        11    3350
 "AMC Spirit"              3799        22         .     3.0        12    2640
 "Buick Century"           4816        20         3     4.5        16    3250
 "Buick Electra"           7827        15         4     4.0        20    4080
 "Buick LeSabre"           5788        18         3     4.0        21    3670

You would probably expect that the data in the file could be read with Stata’s infile command and that, subsequently, the data in memory would contain 6 rows. However, after Stata reads the data, you might be surprised to find the following output:

 . infile using auto.raw, clear
 (output omitted)
          
 . list
     
      +-----------------------------------------------------------------+
      |          make   price   mpg   rep78   headroom   trunk   weight |
      |-----------------------------------------------------------------|
   1. |                     .     .       .          .       .        . |
   2. |   AMC Concord    4099    22       3        2.5      11     2930 |
   3. |                     .     .       .          .       .        . |
   4. |     AMC Pacer    4749    17       3          3      11     3350 |
   5. |                     .     .       .          .       .        . |
      |-----------------------------------------------------------------|
   6. |    AMC Spirit    3799    22       .          3      12     2640 |
   7. |                     .     .       .          .       .        . |
   8. | Buick Century    4816    20       3        4.5      16     3250 |
   9. |                     .     .       .          .       .        . |
  10. | Buick Electra    7827    15       4          4      20     4080 |
      |-----------------------------------------------------------------|
  11. |                     .     .       .          .       .        . |
  12. | Buick LeSabre    5788    18       3          4      21     3670 |
  13. |                     .     .       .          .       .        . |
      +-----------------------------------------------------------------+

At this point, you are probably wondering what has happened. The key to knowing what caused this behavior is to understand the end-of-line (EOL) characters on various platforms. Stata can safely and accurately read raw data that has valid Windows, Mac, or Unix EOL markers. The unexpected behavior encountered in the example above can be explained by the malformed EOL sequences contained in our test file (auto.raw). Valid EOL sequences from all three formats are listed in the table below:

Platform Characters ASCII Codes
Mac \n 10
Unix \n 10
Windows \r\n 13 10

As mentioned above, the file named auto.raw contained invalid EOL sequences. Here is the EOL sequence found in our test file: "\r\r\n". As you can see, the pattern does not match any of the three valid EOL sequences.

The solution

Stata has a command called hexdump, which can read and analyze raw binary data. Using hexdump with its analyze option displays some of the normally hidden attributes associated with a text file. For example,

 . hexdump auto.raw, analyze
        
   Line-end characters                        Line length (tab=1)
     \r\n         (Windows)             15      minimum                        1
     \r by itself (Old Mac)             15      maximum                       77
     \n by itself (Mac or Unix)          0
   Space/separator characters                 Number of lines                 30
     [blank]                           466      EOL at EOF?                  yes
     [tab]                               0
     [comma] (,)                         0    Length of first 5 lines
   Control characters                           Line 1                        13
     binary 0                            0      Line 2                         1
     CTL excl. \r, \n, \t                0      Line 3                        52
     DEL                                 0      Line 4                         1
     Extended (128-159,255)              0      Line 5                        43
   ASCII printable
     A-Z                                28
     a-z                               174    File format                  ASCII
     0-9                                97
     Special (!@#$ etc.)                61
     Extended (160-254)                  0
                           ---------------
   Total                               871
      
   Observed were:
      \n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b
      c d e f g h i k l m n o p r s t u w y { }

In the output above, we can see that there are both Windows and Mac EOL characters present. We can use Stata’s filefilter command to strip out the unwanted Mac EOL characters (i.e., the first \r in the \r\r\n sequence). For example,

 . filefilter auto.raw auto2.raw, from(\r\r) to(\r) replace
        
 . hexdump auto2.raw, analyze
     
   Line-end characters                        Line length (tab=1)
     \r\n         (Windows)             15      minimum                        2
     \r by itself (Old Mac)              0      maximum                       77
     \n by itself (Mac or Unix)          0
   Space/separator characters                 Number of lines                 15
     [blank]                           466      EOL at EOF?                  yes
     [tab]                               0
     [comma] (,)                         0    Length of first 5 lines
   Control characters                           Line 1                        13
     binary 0                            0      Line 2                        52
     CTL excl. \r, \n, \t                0      Line 3                        43
     DEL                                 0      Line 4                        51
     Extended (128-159,255)              0      Line 5                        56
   ASCII printable
     A-Z                                28
     a-z                               174    File format                  ASCII
     0-9                                97
     Special (!@#$ etc.)                61
     Extended (160-254)                  0
                           ---------------
   Total                               856
     
   Observed were:
      \n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b
      c d e f g h i k l m n o p r s t u w y { }

We can see that all "\r\r" sequences were replaced by "\r", which yields a new "\r\n". Now our file contains valid Windows EOL sequences.

The new file, named auto2.raw, can now be read into Stata with its accompanying data dictionary by using the infile command. For example,

 . infile using auto2.raw, clear
 (output omitted)
        
 . list
        
      +-----------------------------------------------------------------+
      |          make   price   mpg   rep78   headroom   trunk   weight |
      |-----------------------------------------------------------------|
   1. |   AMC Concord    4099    22       3        2.5      11     2930 |
   2. |     AMC Pacer    4749    17       3          3      11     3350 |
   3. |    AMC Spirit    3799    22       .          3      12     2640 |
   4. | Buick Century    4816    20       3        4.5      16     3250 |
   5. | Buick Electra    7827    15       4          4      20     4080 |
      |-----------------------------------------------------------------|
   6. | Buick LeSabre    5788    18       3          4      21     3670 |
      +-----------------------------------------------------------------+