Home  /  Resources & support  /  FAQs  /  Infile dictionary options

infile dictionary options

Title   Infile dictionary options
Author James Hardin, StataCorp
Date January 1996

If you are reading a dataset with a dictionary, then Stata is reading that data in record mode. This means that Stata has the concept of a row of data coming in from the raw data file and somehow being split up into the variables. Using the dictionary, you have complete control over how that information is assigned to your variables in Stata. With this power, you need to learn how to use it to best accomplish your goals. In a dictionary, there is one line for each variable that you will be reading in. On each line there are the following directives
  1. An optional column(#) directive stating where to begin reading. This tells Stata to move to the specified column in order to read in the data associated with this particular variable. By default, Stata will just move to the next column.
  2. An optional skip(#) directive stating how far over to skip from the end of the last field that was processed. This tells Stata how many columns to skip before reading in the data associated with this particular variable. By default, Stata will just skip over one more column from the previous variable.
  3. An optional data type for how to store the variable. Do not get this directive confused with the read format specifier. This directive only affects how the data is stored and not how it is read. Just because you specify that a variable is type str5 does not mean that Stata will read in 5 columns of data for this variable. In order to control how many columns are read for a particular variable, you must specify a read format. If you do not specify a type for the variable, it will default to float.
  4. A required variable name. You must specify a variable name for each of the fields that you will read from the raw data records. There is no shortcut for this.
  5. An optional label name to specify a value label for the variable. You can use this if you want to associate a value label name that you will apply to a numeric variable. (Value labels allow numeric variables to contain "strings", the strings being numerically encoded.) This is rarely used (the data is typically read into string variables) and may only be useful when you will be applying a great number of value labels to your variables, or when you are defining the value labels and the infile steps in a do-file.
  6. An optional read format to read the field for this variables values. Many people assume that once the type is specified, this does not have to be also specified. On the contrary, the read format is even more important in many cases as this is what tells Stata how many columns should be read for a particular variable.
  7. An optional variable label to apply to this variable. This is the descriptive label associated with a variable that is printed out to the right of the variable in the describe command. You do not have to specify this variable though for large datasets, it can be helpful.
There are times that you need to specify only a few of these, and there are other times that you may need to specify many of these directives.

The above tools allow you to control for each variable

  1. how it is read
  2. where it is read from
  3. how it is stored
  4. what the variable is called
  5. what the variable is labeled
  6. how the values are labeled
This is usually enough for almost all datasets that you encounter. However, there are other datasets that have additional complexities to how they are organized in the file. To address those additional complexities, you may specify these other directives to further control the overall behavior of Stata as it processes the data file
  1. lrecl(#) will allow you to specify the logical record length for the file. In most files, there are line breaks from one record to the next. In other files, there are no line lengths, but each line is a certain number of characters long. In order to specify this length, you use this directive.
  2. newline(#) will allow you to specify that the next field described begins on # lines down in the file. Some datasets are organized in such a way that each record extends across multiple lines in the file.
  3. comments are allowed in the dictionary file and give you the opportunity to add notes to the dictionary files that you create for reading in your raw data files. This is the most overlooked optional directives available for the dictionary file. However, you should use it as it will allow you to return to old dictionaries and remind yourself of how you solved problematic reads.