|Title||Converting other format files into Stata dataset files|
|Author||Nicholas J. Cox, Durham University, UK|
One solution to converting Excel datasets to Stata format is to obtain a data-translation package.
You can also read the dataset without resorting to such packages. That is the subject of this FAQ, whose content is
Stata expects one matrix or table of data from one sheet, with at most one line of text at the start defining the contents of the columns.
Stata can directly import data from Excel (both .xls and .xlsx) files.
Select File > Import > Excel Spreadsheet from Stata's menus.
Using your Windows or Mac computer,
The following section is based on material originally written by James Hardin, University of South Carolina, and Ted Anagnoson, California State Los Angeles.
The following section is provided by Kevin Turner, StataCorp.
Once you have a suitable graph in Stata's Graph window,
These steps should also work in other packages that accept input in this manner.
The following section is from material by Ted Anagnoson, California State Los Angeles; Dan Chandler, Trinidad, CA; Ronan Conroy, Royal College of Surgeons, Dublin; David Moore, Hartman Group; Paul Wicks, South Manchester University Hospitals Trust; Eric Wruck, Positive Economics; and Colin Fischbacher, University of Edinburgh.
The problems mentioned in it are primarily with respect to text-based methods of importing data from Excel to Stata, such as copying and pasting and import delimited. import excel handles most of these issues.
One cell containing a nonnumeric character, such as a letter, within a column of data is enough for Stata to make that variable a string variable. It is often easiest to fix this in Excel. Within Stata, suppose the problematic string variable is named foo. Here are three alternative ways to identify the rogue observations:
. tab foo if real(foo) == . . edit foo if real(foo) == . . list foo if real(foo) == .
If appropriate, they can be replaced by missing, and then the variable as a whole can be converted to numeric by typing:
. replace foo = "" if real(foo) == . . gen newfoo = real(foo) . drop foo . rename newfoo foo
You could also use destring:
. destring foo, replace
destring includes an option for stripping commas, dollar signs, percent signs, and other nonnumeric characters. It also allows automatic conversion of percentage data.
What appear to be purely numeric data in Excel are often treated by Stata as string variables because they include spaces. People may inadvertently enter space characters in cells that are otherwise empty. Although Excel strips leading and trailing spaces from numeric entries, it does not trim spaces from character entries. One or more space characters by themselves constitute a valid character entry and are stored as such. Stata dutifully reads the entire column as a string variable.
Excel has a search and replace capability that enables you to delete these stray spaces, or you can use a text-processing program or a text editor on the text file. You can also use the solution in nonnumeric characters.
Much of the formatting in Excel interferes with Stata's ability to interpret the data reasonably. Just before saving the data as a text file, make sure that all formatting is turned off, at least temporarily. You can do this by highlighting the entire spreadsheet, selecting Format, then selecting Cells, and clicking General.
However, no solution solves all problems. Here is a cautionary tale. A text file included address data. One column included house numbers, and a few were in the form 4/2. Excel decided these few were dates and converted them to 4th February. Setting all cells to a General format does not help because it converts these unwanted dates to 5 digit Excel date codes. One solution is to apply a Text format to the offending column when offered the option during Excel's text import process. But even this works only if you have manageably few columns to look through and are aware of the possibility of the problem.
Stata limits variable names to 32 characters and does not allow any characters that it uses as operators or delimiters. Also, variable names should start with a letter. People who are Excel users first and Stata users second are often creative with the names they apply to columns. Stata converts illegal column (field) names to labels and makes a best guess at a sensible variable name. Stata's best guess, however, may not be as good as the name a user would choose knowing Stata's naming restrictions.
For example, Stata will make variable names using the first 32 characters of the variable name and use the rest for a label. If the first 32 characters are not unique, subsequent occurrences will be called var1, var2, etc., or v1, v2, etc. (If you paste the data, the variable stub is var; if you use insheet, the stub is v, so be careful writing do-files.)
Stata completely ignores empty rows in a spreadsheet but not completely empty columns. A completely empty column gets read in as a variable with missing values for every observation. Of course, no harm is done in either case, but spreadsheet users who wish to add blank columns and/or rows to enhance legibility may wish to note this difference.
It is best if the first row of data is complete with no missing data. If necessary, add a dummy row with every value present, and then once in Stata type
. drop in 1
The missings command by Nicholas J. Cox, which allows variables or observations that are all missing to be easily dropped, was published in Stata Journal 15(4). Type search dm0085 for information on this command.
With integer-like codes, such as ICD-9 codes or U.S. Social Security numbers, that do not contain a dash, leading zeros will get dropped when pasted into Stata from Excel. One solution is to flag the variable as a string in the first line: add a nonnumeric character in Excel on that line, and then remove it in Stata.
The missing leading zeros can also be replaced in a conversion to string with one Stata command line; for example,
. gen svar = string(nvar, "%012.0f")
The second argument on the right-hand side of this command is a format specifying leading zeros on conversion of nvar to its string equivalent. For more details on formats, see format.
Confirm the filename and location of the file you are trying to read. Use Explorer or its equivalent to check.
For example, you may have inadvertently produced a file named filename.txt.txt, or more generally, a name that ends with two extensions, which may or may not be the same extension. This naming is possible if you have an option checked in Windows Explorer under View, Folder Options to hide file extensions for known or registered file types. Manually rename the file, or use the correct filename in your Stata command. You may also wish to uncheck this option to avoid similar mistakes in the future.