Home  /  Resources & support  /  FAQs  /  Reading fixed-format data with infile

How do I use infile to read in fixed-format data?

Title   Reading fixed-format data with infile
Author James Hardin, StataCorp

You need to use a dictionary to read in fixed-format data. Creating a dictionary can be confusing if you get caught up in all the gory details. We can offer some advice that will handle most of the text files that you encounter. Some other special cases will be addressed in the later examples.

The best advice to solve almost all the infile problems that you encounter when reading fixed-format files is to do the following:

  • Enter one line in the dictionary for each variable that you will read.
  • For each variable that you read in, stipulate the following four settings:
    • starting column:
    • storage type:
    • variable name:
    • read format:

Note: The most frequent cause of confusion for users is that the “storage type” has nothing to do with how a field is read from the text file. It only affects how that field is stored in Stata after it is read from the text file. To control how the field is read from the text file, use the “read format”.

For example, specifying that a variable should be stored as type str25 means that Stata should read 25 columns of information when it processes the text file.

Example 1

To read the data from the test1.raw file below,

        1101
        0111
        1100

you can use the following dictionary in the file test1a.dct:

        dictionary using test1.raw {
          _column(1)     byte     b1     %1f
          _column(2)     byte     b2     %1f
          _column(3)     byte     b3     %1f
          _column(4)     byte     b4     %1f
        }

You could also use the dictionary in file test1b.dct:

        dictionary using test1.raw {
          _column(2)     byte     b2     %1f
          _column(1)     byte     b1     %1f
          _column(4)     byte     b4     %1f
          _column(3)     byte     b3     %1f
        }

The only difference in these two approaches is in the order that the variables are stored in Stata.

 . clear

 . quietly infile using test1a

 . list

      +-------------------+
      | b1   b2   b3   b4 |
      |-------------------|
   1. |  1    1    0    1 |
   2. |  0    1    1    1 |
   3. |  1    1    0    0 |
      +-------------------+

 . clear

 . quietly infile using test1b

 . list

      +-------------------+
      | b2   b1   b4   b3 |
      |-------------------|
   1. |  1    1    1    0 |
   2. |  1    0    1    1 |
   3. |  1    1    0    0 |
      +-------------------+

This example also shows that you can access the columns of the text file in any order.

Example 2

Now let us say that you want to read in the data from the file test2.raw:

        C1245A101George Costanza
        B1223B011Cosmo Kramer

In this file, we have documentation from the person that supplied the data that

  • A unique code called id is in the first 5 columns.
  • A 4-digit call number is part of the id in columns 2–5.
  • A letter denoting a city code is in column 6.
  • A 3-digit code denoting a neighborhood code is in columns 7–9.
  • A name appears in columns 10–25.

So, we can prepare a dictionary like this:

        dictionary using test2.raw {
          _column(1)     str5     code   %5s
          _column(2)     int      call   %4f
          _column(6)     str1     city   %1s
          _column(7)     int      neigh  %3f
          _column(10)    str16    name   %16s
        }

This example shows that you can reread columns, placing their contents into different variables. Although the data look much more complicated in this example, our approach of always giving four properties makes our dictionary easy to read and easy to match the documentation that came with our data.

 . clear

 . quietly infile using test2

 . list

      +-----------------------------------------------+
      |  code   call   city   neigh              name |
      |-----------------------------------------------|
   1. | C1245   1245      A     101   George Costanza |
   2. | B1223   1223      B      11      Cosmo Kramer |
      +-----------------------------------------------+

Example 3

Here we introduce records that extend more than one line in the text file. The only additional responsibility that we have when we make our dictionary is that we must specify at what point the record extends to the new line. Consider the data in test3.raw:

        Jonathan Swift
        12345 South Mockingbird
        Detroit, Michigan
        1010111
        e e cummings
        4123 Elm
        Buffalo, New York
        1101210

Our data documentation that accompanied this file tells us that

  • The name appears on the first line and is at most 15 characters long.
  • The address appears on the second line and is at most 30 characters long.
  • The city appears on the third line and is at most 20 characters long.
  • Seven yes/no questions appear on the fourth line, where yes=1 and no=0.

For these data, we can prepare the dictionary:

        dictionary using test3.raw {
          _column(1)     str15    name   %15s
        _newline
          _column(1)     str30    addr   %30s
        _newline
          _column(1)     str20    city   %20s
        _newline
          _column(1)     byte     yesno1 %1f
          _column(2)     byte     yesno2 %1f
          _column(3)     byte     yesno3 %1f
          _column(4)     byte     yesno4 %1f
          _column(5)     byte     yesno5 %1f
          _column(6)     byte     yesno6 %1f
          _column(7)     byte     yesno7 %1f
        }

After a _newline, we start over when we refer to the _column(#) at 1. Here is the result:

 . clear

 . quietly infile using test3.dct

 . list

      +-----------------------------------------------------------------------+
   1. |           name |                    addr |              city | yesno1 |
      | Jonathan Swift | 12345 South Mockingbird | Detroit, Michigan |      1 |
      |-----------------------------------------------------------------------|
      |  yesno2   |  yesno3   |  yesno4   |  yesno5   |  yesno6   |  yesno7   |
      |       0   |       1   |       0   |       1   |       1   |       1   |
      +-----------------------------------------------------------------------+

      +-----------------------------------------------------------------------+
   2. |           name |                    addr |              city | yesno1 |
      |   e e cummings |                4123 Elm | Buffalo, New York |      1 |
      |-----------------------------------------------------------------------|
      |  yesno2   |  yesno3   |  yesno4   |  yesno5   |  yesno6   |  yesno7   |
      |       1   |       0   |       1   |       2   |       1   |       0   |
      +-----------------------------------------------------------------------+

More notes

Another piece of advice for reading large text files is to use in exp to limit the dictionary to read just one observation. This limit will allow you to test your dictionary and see if it is working properly.

 . infile using test1a in 1

 dictionary using test1.raw {
         _column(1)      byte    b1      %1f
         _column(2)      byte    b2      %1f
         _column(3)      byte    b3      %1f
         _column(4)      byte    b4      %1f
 }

 (1 observations read)

 . list

      +-------------------+
      | b1   b2   b3   b4 |
      |-------------------|
   1. |  1    1    0    1 |
      +-------------------+

Since that looks OK, I might continue reading in the entire dataset, or I might read in the first five lines to further test my dictionary, which brings up an important point. If you get documentation with the data that you are trying to read into Stata, you should always use the assert command to check that the data follow the description set out in the documentation. For instance, in the previous example, the documentation said that there were 7 yes/no questions coded as 1=yes and 0=no. After reading in your data, you should check that

 . assert yesno1==0 | yesno1==1

 . assert yesno2==0 | yesno2==1

 . assert yesno3==0 | yesno3==1

 . assert yesno4==0 | yesno4==1

 . assert yesno5==0 | yesno5==1
 1 contradiction out of 2
 assertion is false
 r(9);

 . assert yesno6==0 | yesno6==1

 . assert yesno7==0 | yesno7==1

As you can see, one of the assertions was invalid. That might mean that our dictionary is wrong. On the other hand, it could mean that the documentation that came with your data is wrong. Regardless, we should note that this discrepancy exists and question the data provider about it.