How do I get information from Excel into Stata?
|
Title
|
|
Converting other format files into Stata dataset files
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
March 2003; minor revision August 2005, August 2006
|
One solution to converting Excel datasets to Stata format is to obtain a
data-translation package. You could obtain
Stat/Transfer from StataCorp.
You can also read the dataset without resorting to such packages. That is
the subject of this FAQ, whose content is
-
1. A rule to remember
2. How to get information from Excel into Stata
3. Other methods for transferring information
-
3.1 insheet command
3.2 XML Spreadsheet and xmluse
3.3 ODBC and odbc load
4. Copying a Stata graph into Excel or any other package
5. Common problems
-
5.1 Nonnumeric characters
5.2 Spaces
5.3 Cell formats
5.4 Variable names
5.5 Missing rows and columns
5.6 Leading zeros
5.7 Filename and folder
1. A rule to remember
Stata expects one matrix or table of data from one sheet,
with at most one line of text at the start defining the contents of
the columns.
2. How to get information from Excel into Stata
Using your Windows or Macintosh computer,
-
Start Excel.
-
Enter data in rows and columns or read in a previously saved file.
-
Highlight the data of interest, and then select Edit and click
Copy.
-
Start Stata and open the Data Editor (type edit at the Stata dot
prompt).
-
Paste data into editor by selecting Edit and clicking Paste.
3. Other methods for transferring information
3.1 insheet command
The following section is from material by James Hardin, University of South
Carolina, and Ted Anagnoson, California State Los Angeles.
-
Launch Excel and read in your Excel file.
-
Save as a text file (tab delimited or comma delimited) by
selecting File and clicking Save As.
If the original filename is
filename.xls, then save the file under the name
filename.txt or filename.csv.
(Use the Save as type list—specifying an extension such as
.txt is not sufficient to produce a text file.)
-
Quit Excel if you wish.
-
Launch Stata if it is not already running. (If Stata is already running,
then either
save or
clear
your current data.)
-
In Stata, type insheet using filename.ext,
where filename.ext is the name of the file that you
just saved in Excel. Give the complete filename, including the
extension.
-
In Stata, type
compress.
-
Save the data as a Stata dataset using the
save
command.
3.2 XML Spreadsheet and xmluse
The following is provided by Kevin Turner, StataCorp.
- Launch Excel and read in your Excel file.
- Save as XML Spreadsheet file by selecting File and
clicking Save As. You will need Excel 2003 or
greater to be provided with the option to save as an XML
Spreadsheet. If the original filename is
filename.xls, then save the file under the name
filename.xml. (Use the Save as
type list—specifying an extension such as
.xml is not sufficient to produce an XML file.)
- Quit Excel if you wish.
- Launch Stata if it is not already running. (If Stata is already
running, then either save or clear your current data.)
- In Stata, type xmluse filename.ext,
doctype(excel) allstring, where
filename.ext is the name of the file that
you just saved in Excel. Give the complete
filename, including the extension. (Option allstring
specifies that you wish to import all the data as strings.
Importing data types as they are, without the option, might be
desired, but carries more risk of errors due to
inconsistent data types in a column. More general problems
with importing Excel files are discussed in
section 5.)
- Save the data as a Stata dataset using the
save
command.
3.3 ODBC and odbc load
The following section is provided by Kevin Turner, StataCorp.
- Launch Stata.
- List the ODBC data sources that have been defined by
Windows using the odbc list command.
- Click DSN (data source name) listing provided by odbc
list to query that DSN. Typically, Microsoft provides a
default entry called “Excel Files” that you can
use to choose any Excel (*.xls) file to load via ODBC. You
must select an Excel file every time you issue an
odbc command using this DSN. You can also
define your own DSN that always points to a specific Excel
file. On Windows XP, you would define this special DSN via the
Control Panel called “Administrative Tools”, and
then select “Data Sources (ODBC)”. More
documentation is available from Microsoft concerning how to define
your own Data Sources.
- Click the sheet/table corresponding to your data within the
Excel file to describe the contents. You may need to
issue the odbc query command with the
dialog(complete) option if you selected an arbitrary
Excel file in the previous list.
- If you are satisfied with the previous description of the
sheet/table, you can click to load the described table.
- If all goes well, your data will load into Stata. There are,
however, a few general reasons why loading Excel via ODBC
may be problematic, and those are covered in
section 5.
4. Copying a Stata graph into Excel or any other package
Once you have a suitable graph in Stata's Graph window,
-
Select Edit and click Copy Graph.
-
Open or switch to Excel and move to where you want to paste the graph.
-
Select Edit and click Paste.
These steps should also work in other packages that accept input in this
manner.
5. Common problems
The following section is from material by Ted Anagnoson, California State
Los Angeles; Dan Chandler, Trinidad, CA; Ronan Conroy, Royal College of
Surgeons, Dublin; David Moore, Hartman Group; Paul Wicks, South Manchester
University Hospitals Trust; Eric Wruck, Positive Economics; and Colin
Fischbacher, University of Edinburgh.
5.1 Nonnumeric characters
One cell containing a nonnumeric character, such as a letter, within a
column of data is enough for Stata to make that variable a string variable.
It is often easiest to fix this in Excel. Within Stata,
suppose the problematic string variable is named foo.
Here are three alternative ways to identify the rogue observations:
. tab foo if real(foo) == .
. edit foo if real(foo) == .
. list foo if real(foo) == .
If appropriate, they can be replaced by missing, and then the variable as a
whole can be converted to numeric by typing:
. replace foo = "" if real(foo) == .
. gen newfoo = real(foo)
. drop foo
. rename newfoo foo
You could also use
destring:
. destring foo, replace
As implemented in Stata 7, destring includes an option for stripping
commas, dollar signs, percent signs, and other nonnumeric characters. It
also allows automatic conversion of percentage data.
5.2 Spaces
What appear to be purely numeric data in Excel are often treated by
Stata as string variables because they include spaces. People may
inadvertently enter space characters in cells that are otherwise
empty. Although Excel strips leading and trailing spaces from numeric
entries, it does not trim spaces from character entries. One or more
space characters by themselves constitute a valid character entry and
are stored as such. Stata dutifully reads the entire column as a
string variable.
Excel has a search and replace capability that enables you to delete these
stray spaces, or you can use a text-processing program or a text editor on
the text file. You can also use the solution in
nonnumeric characters.
5.3 Cell formats
Much of the formatting in Excel interferes with Stata's ability to interpret
the data reasonably. Just before saving the data as a text file, make sure
that all formatting is turned off, at least temporarily. You can do this by
highlighting the entire spreadsheet, selecting Format, then
selecting Cells, and clicking General.
However, no solution solves all problems. Here is a cautionary tale. A text
file included address data. One column included house numbers, and a few
were in the form 4/2. Excel decided these few were dates and converted them
to 4th February. Setting all cells to a General format does not help
because it converts these unwanted dates to 5 digit Excel date codes. One
solution is to apply a Text format to the offending column when
offered the option during Excel's text import process. But even this works
only if you have manageably few columns to look through and are aware of the
possibility of the problem.
5.4 Variable names
Stata limits variable names to 32 characters and does not allow
any characters that it uses as operators or delimiters. Also,
variable names should start with a letter. People who are Excel
users first and Stata users second are often creative with the names
they apply to columns. Stata converts illegal column (field) names to labels
and makes a best guess at a sensible variable name. Stata's best guess,
however, may not be as good as the name a user would choose knowing Stata's
naming restrictions.
For example, Stata will make variable names using the first 32 characters
of the variable name and use the rest for a label. If the first 32
characters are not unique, subsequent occurrences will be called
var1, var2, etc., or v1, v2, etc. (If you paste
the data, the variable stub is var; if you use insheet, the
stub is v, so be careful writing do-files.)
5.5 Missing rows and columns
Stata completely ignores empty rows in a spreadsheet but not
completely empty columns. A completely empty column gets read
in as a variable with missing values for every observation. Of
course, no harm is done in either case, but spreadsheet users who wish
to add blank columns and/or rows to enhance legibility may wish to
note this difference.
It is best if the first row of data is complete with no missing data.
If necessary, add a dummy row with every value present, and then once in
Stata type
. drop in 1
A command dropmiss by Nicholas J. Cox, which allows easy dropping of
variables or observations that are all missing, was published in STB-60.
Type search dropmiss for information on this command.
5.6 Leading zeros
With integer-like codes, such as ICD-9 codes or U.S. Social Security
numbers, that do not contain a dash, leading zeros will get dropped when
pasted into Stata from Excel. One solution is to flag
the variable as a string in the first line: add a nonnumeric character in Excel on that
line, and then remove it in Stata.
The missing leading zeros can also be replaced in a conversion to string
with one Stata command line; for example,
. gen svar = string(nvar, "%012.0f")
The second argument on the right-hand side of this command is a format
specifying leading zeros on conversion of nvar to its string
equivalent. For more details on formats, see
format.
5.7 Filename and folder
Confirm the filename and location of the file you are trying to read.
Use Explorer or its equivalent to check.
For example, you may have inadvertently produced a file named
filename.txt.txt, or more generally, a name that ends with two
extensions, which may or may not be the same extension. This naming is
possible if you have an option checked in Windows Explorer under
View, Folder Options to hide file extensions for known or
registered file types. Manually rename the file, or use the correct
filename in your Stata command. You may also wish to uncheck this option to
avoid similar mistakes in the future.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|