Search
   >> Home >> Resources & support >> FAQs >> Converting other format files into Stata dataset files

How do I get information from Excel into Stata?

Title   Converting other format files into Stata dataset files
Author Nicholas J. Cox, Durham University, UK
Date March 2003; updated April 2013; minor revision February 2014

One solution to converting Excel datasets to Stata format is to obtain a data-translation package. You could obtain Stat/Transfer from StataCorp.

You can also read the dataset without resorting to such packages. That is the subject of this FAQ, whose content is

1. A rule to remember
2. How to get information from Excel into Stata
3. Other methods for transferring information
3.1 Copy and paste
3.2 import delimited command
3.3 ODBC and odbc load
4. Copying a Stata graph into Excel or any other package
5. Common problems
5.1 Nonnumeric characters
5.2 Spaces
5.3 Cell formats
5.4 Variable names
5.5 Missing rows and columns
5.6 Leading zeros
5.7 Filename and folder

1. A rule to remember

Stata expects one matrix or table of data from one sheet, with at most one line of text at the start defining the contents of the columns.

2. How to get information from Excel into Stata

Stata can directly import data from Excel (both .xls and .xlsx) files.

Select File > Import > Excel Spreadsheet from Stata's menus.

3. Other methods for transferring information

3.1 Copy and paste

Using your Windows or Mac computer,

  1. Start Excel.
  2. Enter data in rows and columns or read in a previously saved file.
  3. Highlight the data of interest, and then select Edit and click Copy.
  4. Start Stata and open the Data Editor (type edit at the Stata dot prompt).
  5. Paste data into editor by selecting Edit and clicking Paste.

3.2 import delimited command

The following section is based on material originally written by James Hardin, University of South Carolina, and Ted Anagnoson, California State Los Angeles.

  1. Launch Excel and read in your Excel file.
  2. Save as a text file (tab delimited or comma delimited) by selecting File and clicking Save As. If the original filename is filename.xls, then save the file under the name filename.txt or filename.csv. (Use the Save as type list—specifying an extension such as .txt is not sufficient to produce a text file.)
  3. Quit Excel if you wish.
  4. Launch Stata if it is not already running. (If Stata is already running, then either save or clear your current data.)
  5. In Stata, type import delimited using filename.ext, where filename.ext is the name of the file that you just saved in Excel. Give the complete filename, including the extension.
  6. In Stata, type compress.
  7. Save the data as a Stata dataset using the save command.

3.3 ODBC and odbc load

The following section is provided by Kevin Turner, StataCorp.

  1. Launch Stata.
  2. List the ODBC data sources that have been defined by Windows using the odbc list command.
  3. Click DSN (data source name) listing provided by odbc list to query that DSN. Typically, Microsoft provides a default entry called “Excel Files” that you can use to choose any Excel (*.xls) file to load via ODBC. You must select an Excel file every time you issue an odbc command using this DSN. You can also define your own DSN that always points to a specific Excel file. On Windows XP, you would define this special DSN via the Control Panel called “Administrative Tools”, and then select “Data Sources (ODBC)”. More documentation is available from Microsoft concerning how to define your own Data Sources.
  4. Click the sheet/table corresponding to your data within the Excel file to describe the contents. You may need to issue the odbc query command with the dialog(complete) option if you selected an arbitrary Excel file in the previous list.
  5. If you are satisfied with the previous description of the sheet/table, you can click to load the described table.
  6. If all goes well, your data will load into Stata. There are, however, a few general reasons why loading Excel via ODBC may be problematic, and those are covered in section 5.

4. Copying a Stata graph into Excel or any other package

Once you have a suitable graph in Stata's Graph window,

  1. Select Edit and click Copy Graph.
  2. Open or switch to Excel and move to where you want to paste the graph.
  3. Select Edit and click Paste.

These steps should also work in other packages that accept input in this manner.

5. Common problems

The following section is from material by Ted Anagnoson, California State Los Angeles; Dan Chandler, Trinidad, CA; Ronan Conroy, Royal College of Surgeons, Dublin; David Moore, Hartman Group; Paul Wicks, South Manchester University Hospitals Trust; Eric Wruck, Positive Economics; and Colin Fischbacher, University of Edinburgh.

The problems mentioned in it are primarily with respect to text-based methods of importing data from Excel to Stata, such as copying and pasting and import delimited. import excel handles most of these issues.

5.1 Nonnumeric characters

One cell containing a nonnumeric character, such as a letter, within a column of data is enough for Stata to make that variable a string variable. It is often easiest to fix this in Excel. Within Stata, suppose the problematic string variable is named foo. Here are three alternative ways to identify the rogue observations:

        . tab foo if real(foo) == .
        . edit foo if real(foo) == .
        . list foo if real(foo) == .

If appropriate, they can be replaced by missing, and then the variable as a whole can be converted to numeric by typing:

        . replace foo = "" if real(foo) == .
        . gen newfoo = real(foo)
        . drop foo 
        . rename newfoo foo

You could also use destring:

        . destring foo, replace

destring includes an option for stripping commas, dollar signs, percent signs, and other nonnumeric characters. It also allows automatic conversion of percentage data.

5.2 Spaces

What appear to be purely numeric data in Excel are often treated by Stata as string variables because they include spaces. People may inadvertently enter space characters in cells that are otherwise empty. Although Excel strips leading and trailing spaces from numeric entries, it does not trim spaces from character entries. One or more space characters by themselves constitute a valid character entry and are stored as such. Stata dutifully reads the entire column as a string variable.

Excel has a search and replace capability that enables you to delete these stray spaces, or you can use a text-processing program or a text editor on the text file. You can also use the solution in nonnumeric characters.

5.3 Cell formats

Much of the formatting in Excel interferes with Stata's ability to interpret the data reasonably. Just before saving the data as a text file, make sure that all formatting is turned off, at least temporarily. You can do this by highlighting the entire spreadsheet, selecting Format, then selecting Cells, and clicking General.

However, no solution solves all problems. Here is a cautionary tale. A text file included address data. One column included house numbers, and a few were in the form 4/2. Excel decided these few were dates and converted them to 4th February. Setting all cells to a General format does not help because it converts these unwanted dates to 5 digit Excel date codes. One solution is to apply a Text format to the offending column when offered the option during Excel's text import process. But even this works only if you have manageably few columns to look through and are aware of the possibility of the problem.

5.4 Variable names

Stata limits variable names to 32 characters and does not allow any characters that it uses as operators or delimiters. Also, variable names should start with a letter. People who are Excel users first and Stata users second are often creative with the names they apply to columns. Stata converts illegal column (field) names to labels and makes a best guess at a sensible variable name. Stata's best guess, however, may not be as good as the name a user would choose knowing Stata's naming restrictions.

For example, Stata will make variable names using the first 32 characters of the variable name and use the rest for a label. If the first 32 characters are not unique, subsequent occurrences will be called var1, var2, etc., or v1, v2, etc. (If you paste the data, the variable stub is var; if you use insheet, the stub is v, so be careful writing do-files.)

5.5 Missing rows and columns

Stata completely ignores empty rows in a spreadsheet but not completely empty columns. A completely empty column gets read in as a variable with missing values for every observation. Of course, no harm is done in either case, but spreadsheet users who wish to add blank columns and/or rows to enhance legibility may wish to note this difference.

It is best if the first row of data is complete with no missing data. If necessary, add a dummy row with every value present, and then once in Stata type

        . drop in 1

A command dropmiss by Nicholas J. Cox, which allows easy dropping of variables or observations that are all missing, was published in STB-60 and revised in Stata Journal 8(4). Type search dropmiss for information on this command.

5.6 Leading zeros

With integer-like codes, such as ICD-9 codes or U.S. Social Security numbers, that do not contain a dash, leading zeros will get dropped when pasted into Stata from Excel. One solution is to flag the variable as a string in the first line: add a nonnumeric character in Excel on that line, and then remove it in Stata.

The missing leading zeros can also be replaced in a conversion to string with one Stata command line; for example,

        . gen svar = string(nvar, "%012.0f")

The second argument on the right-hand side of this command is a format specifying leading zeros on conversion of nvar to its string equivalent. For more details on formats, see format.

5.7 Filename and folder

Confirm the filename and location of the file you are trying to read. Use Explorer or its equivalent to check.

For example, you may have inadvertently produced a file named filename.txt.txt, or more generally, a name that ends with two extensions, which may or may not be the same extension. This naming is possible if you have an option checked in Windows Explorer under View, Folder Options to hide file extensions for known or registered file types. Manually rename the file, or use the correct filename in your Stata command. You may also wish to uncheck this option to avoid similar mistakes in the future.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube