How do you process very large datasets in Stata?
|
Title
|
|
Dealing with very large datasets
|
|
Author
|
Kolver Hernandez, Boston College
|
|
Date
|
January 2002; updated July 2011
|
Use Stata/MP or Stata/SE.
If you do not have Stata/MP or Stata/SE, please continue with this FAQ.
When the number of variables in a dataset to be analyzed with Stata is
larger than 2,047 (likely with large surveys), the dataset is divided into
several segments, each saved as a Stata dataset (.dta file). To work with
information contained in two or more .dta files, it is necessary to merge
the segments into a new single file that must not contain more than 2,047
variables. Here is a list of steps to construct a new database with
information merged from different files. Any manipulation of the data made
with a Stata do-file allows you to review and/or repeat the procedure more
easily.
- Review the
codebook or
list
of variables, and determine what information is needed and which files
contain the desired variables.
- Read into Stata the first file or segment:
. use filename.dta
A unique ID for each case (observation) must be provided for each
file to be merged. Typically, the ID for a time-series database is the date of
the observation. For a cross-section database, it is the ID of the
cross-section unit (family identifier, firm CUSIP, etc.), and in panel data,
two characteristics are needed to identify each observation: date and ID.
However, for panel data, sometimes a “case ID” is provided to
facilitate merging.
You must ensure that the form in which the unique ID is held in each
file must match; i.e., you cannot match a str8 (8-character string) to a
str6 ID, nor can you match a string to an integer. Use Stata's
describe command to ensure that the name and data
type of the ID variable are the same in all files.
- Discard the variables that are not needed (keeping the case ID); this
can be done in at least two ways. Wildcards (*) and hyphens (-) may be
used in the varlists; see help
varlist for their use.
- If the useful variables can be listed more easily:
. keep caseID varname1 varname2 ... varnameN
- If the unwanted variables can be listed more easily:
. drop varname1 varname2 ... varnameN
Remember that the case ID must be part of the new file.
- Verify that only the desired variables are in memory:
. describe
- Sort the data by case ID:
. sort caseID
- Save the sorted data currently in memory with a different name:
. save newfile#.dta
- Repeat steps 2–5 for all files containing the desired variables.
Finally, you will end up with a set of new files (newfile1.dta,
newfile2.dta, .... newfileJ.dta) to be merged into a new
dataset. Now you are ready to merge the data.
The merge
command merges corresponding observations from the dataset currently in
memory (called the master dataset) with those from a different
Stata-format dataset (called the using dataset) into single
observations. A new variable _merge is created for informative
purposes (described below). Both files must be previously sorted by the
merge variable(s), e.g., case ID.
- Merge the first two new files.
- Read the master dataset (newfile1.dta recently created):
. use newfile1.dta, clear
- Merge the data with the using dataset (newfile2.dta):
. merge 1:1 caseID using newfile2.dta
- Tabulate _merge:
. tabulate _merge
-
The variable _merge is created automatically, and it takes the
following values:
-
-
_merge==1 if the observation was taken from the master data only
_merge==2 if the observation was taken from the using data only
_merge==3 if the observation match both master and using data
-
You can use the tabulated information to check if the data were merged as
desired.
- Drop the _merge variable:
. drop _merge
- If there are more than two files to merge, use the current data in
memory as the master dataset, and repeat steps 8b-8d for each file to be
merged (newfile3.dta, newfile4.dta, ....
newfileJ.dta).
- Save the new dataset:
. save newdataset.dta
Sample program
Here is an example of how a do-file can be used to merge data contained in
three hypothetical segments.
- Variables to merge: X11, X12, X13, X21, X22, X23, X31, X32, and X33
- Segments containing these variables: segment1.dta, segment2.dta, and
segment3.dta
- Identifier: ID (the variable ID, contained in each of the three
segments)
This do-file merges some variables from the .dta files: segment1.dta,
segment2.dta, and segment3.dta into a new file named newdatabase.dta. This
do-file will be documented in the log-file logmerge.smcl for further
reference.
/* open the log file*/
log using logmerge.smcl, replace
/*Generates three .dta files containing only desired variables*/
use segment1.dta, clear
keep ID X11 X12 X13
sort ID
save newfile1.dta, replace
use segment2.dta, clear
keep ID X21 X22 X23
sort ID
save newfile2.dta, replace
use segment2.dta, clear
keep ID X31 X32 X33
sort ID
save newfile3.dta, replace
clear
/*Merges the three new files generated*/
use newfile1.dta, clear
merge 1:1 ID using newfile2.dta
tab _merge /*check the file logmerge.smcl to verify that _merge
takes the appropriate value*/
drop _merge
/* if _merge is not dropped an error will be generated,
up to this point two segments were merged successfully*/
merge 1:1 ID using newfile3.dta
tab _merge /*check the file logmerge.smcl to
verify that _merge takes the appropriate value*/
drop _merge
/* the three segments were merged successfully, now save the final new
> database*/
save newdatabase.dta
clear
log close
/* The individual segment files may now be discarded or compressed if
> desired */
Acknowledgment
Thanks to Petia Petrova for contributions to this document.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|