How do I use infile to read in fixed-format data?
|
Title
|
|
Reading fixed-format data with infile
|
|
Author
|
James Hardin, StataCorp
|
|
Date
|
October 1996; minor revisions August 2005
|
You need to use a dictionary to read in fixed-format data. Creating a
dictionary can be confusing if you get caught up in all the
gory details. We
can offer some advice that will handle most of the text files that you
encounter. Some other special cases will be addressed in the later examples.
The best advice to solve almost all the
infile problems
that you encounter when reading fixed-format files is to do the following:
- Enter one line in the dictionary for each variable that you will read.
- For each variable that you read in, stipulate the following four settings:
- starting column:
- storage type:
- variable name:
- read format:
Note: The most frequent cause of confusion for users is that the
“storage type” has nothing to do with how a field is read from
the text file. It only affects how that field is stored in Stata after it
is read from the text file. To control how the field is read from the text
file, use the “read format”.
For example, specifying that a variable should be stored as type
str25 means that Stata should read 25 columns of information when it
processes the text file.
Example 1
To read the data from the test1.raw file below,
1101
0111
1100
you can use the following dictionary in the file test1a.dct:
dictionary using test1.raw {
_column(1) byte b1 %1f
_column(2) byte b2 %1f
_column(3) byte b3 %1f
_column(4) byte b4 %1f
}
You could also use the dictionary in file test1b.dct:
dictionary using test1.raw {
_column(2) byte b2 %1f
_column(1) byte b1 %1f
_column(4) byte b4 %1f
_column(3) byte b3 %1f
}
The only difference in these two approaches is in the order that the
variables are stored in Stata.
. clear
. quietly infile using test1a
. list
+-------------------+
| b1 b2 b3 b4 |
|-------------------|
1. | 1 1 0 1 |
2. | 0 1 1 1 |
3. | 1 1 0 0 |
+-------------------+
. clear
. quietly infile using test1b
. list
+-------------------+
| b2 b1 b4 b3 |
|-------------------|
1. | 1 1 1 0 |
2. | 1 0 1 1 |
3. | 1 1 0 0 |
+-------------------+
This example also shows that you can access the columns of the text file in
any order.
Example 2
Now let us say that you want to read in the data from the file
test2.raw:
C1245A101George Costanza
B1223B011Cosmo Kramer
In this file, we have documentation from the person that supplied the data
that
- A unique code called id is in the first 5 columns.
- A 4-digit call number is part of the id in columns 2–5.
- A letter denoting a city code is in column 6.
- A 3-digit code denoting a neighborhood code is in columns 7–9.
- A name appears in columns 10–25.
So, we can prepare a dictionary like this:
dictionary using test2.raw {
_column(1) str5 code %5s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}
This example shows that you can reread columns, placing their contents into
different variables. Although the data look much more complicated in this
example, our approach of always giving four properties makes our dictionary
easy to read and easy to match the documentation that came with our data.
. clear
. quietly infile using test2
. list
+-----------------------------------------------+
| code call city neigh name |
|-----------------------------------------------|
1. | C1245 1245 A 101 George Costanza |
2. | B1223 1223 B 11 Cosmo Kramer |
+-----------------------------------------------+
Example 3
Here we introduce records that extend more than one line in the text file.
The only additional responsibility that we have when we make our dictionary
is that we must specify at what point the record extends to the new line.
Consider the data in test3.raw:
Jonathan Swift
12345 South Mockingbird
Detroit, Michigan
1010111
e e cummings
4123 Elm
Buffalo, New York
1101210
Our data documentation that accompanied this file tells us that
- The name appears on the first line and is at most 15 characters long.
- The address appears on the second line and is at most 30
characters long.
- The city appears on the third line and is at most 20 characters
long.
- Seven yes/no questions appear on the fourth line, where yes=1 and no=0.
For these data, we can prepare the dictionary:
dictionary using test3.raw {
_column(1) str15 name %15s
_newline
_column(1) str30 addr %30s
_newline
_column(1) str20 city %20s
_newline
_column(1) byte yesno1 %1f
_column(2) byte yesno2 %1f
_column(3) byte yesno3 %1f
_column(4) byte yesno4 %1f
_column(5) byte yesno5 %1f
_column(6) byte yesno6 %1f
_column(7) byte yesno7 %1f
}
After a _newline, we start over when we refer to the
_column(#) at 1. Here is the result:
. clear
. quietly infile using test3.dct
. list
+-----------------------------------------------------------------------+
1. | name | addr | city | yesno1 |
| Jonathan Swift | 12345 South Mockingbird | Detroit, Michigan | 1 |
|-----------------------------------------------------------------------|
| yesno2 | yesno3 | yesno4 | yesno5 | yesno6 | yesno7 |
| 0 | 1 | 0 | 1 | 1 | 1 |
+-----------------------------------------------------------------------+
+-----------------------------------------------------------------------+
2. | name | addr | city | yesno1 |
| e e cummings | 4123 Elm | Buffalo, New York | 1 |
|-----------------------------------------------------------------------|
| yesno2 | yesno3 | yesno4 | yesno5 | yesno6 | yesno7 |
| 1 | 0 | 1 | 2 | 1 | 0 |
+-----------------------------------------------------------------------+
More notes
Another piece of advice for reading large text files is to use in
exp to limit the dictionary to read just one observation. This
limit will allow you to test your dictionary and see if it is working
properly.
. infile using test1a in 1
dictionary using test1.raw {
_column(1) byte b1 %1f
_column(2) byte b2 %1f
_column(3) byte b3 %1f
_column(4) byte b4 %1f
}
(1 observations read)
. list
+-------------------+
| b1 b2 b3 b4 |
|-------------------|
1. | 1 1 0 1 |
+-------------------+
Since that looks OK, I might continue reading in the entire dataset, or I
might read in the first five lines to further test my dictionary, which
brings up an important point. If you get documentation with the data that
you are trying to read into Stata, you should always use the assert
command to check that the data follow the description set out in the
documentation. For instance, in the previous example, the documentation
said that there were 7 yes/no questions coded as 1=yes and 0=no. After
reading in your data, you should check that
. assert yesno1==0 | yesno1==1
. assert yesno2==0 | yesno2==1
. assert yesno3==0 | yesno3==1
. assert yesno4==0 | yesno4==1
. assert yesno5==0 | yesno5==1
1 contradiction out of 2
assertion is false
r(9);
. assert yesno6==0 | yesno6==1
. assert yesno7==0 | yesno7==1
As you can see, one of the assertions was invalid. That might mean that our
dictionary is wrong. On the other hand, it could mean that the
documentation that came with your data is wrong. Regardless, we should note
that this discrepancy exists and question the data provider about it.
|