The following material is based on postings
to Statalist.
How can I save one or more parts of a large dataset?
|
Title
|
|
Saving one or more parts of a dataset
|
|
Author
|
Paul Seed, Wolfson Institute of Preventive Medicine, London
Nicholas J. Cox, Durham University, UK
Jean Marie Linhart, StataCorp
|
|
Date
|
May 2002; updated February 2003
|
The save
command does not allow specification either of a
varlist,
which would be used to specify a subset of variables, or of
if or
in
conditions, which would be used to specify a subset of observations.
Extending save in these directions is on the StataCorp to-do list,
but, at present, these limitations need to be bypassed in some other way.
We assume the main dataset has previously been saved to a Stata data
file in binary format (a .dta file). If not, you should save
the data first:
. save main
1 Use keep or drop first
The first way to save part of a large dataset is to use
keep or
drop first.
- If you wish to save only some variables, then first keep
those variables (or if you find it easier, drop variables you do
not want).
- If you wish to save only some observations, then keep those
observations (or if you find it easier, drop observations you do
not want).
- Now save the data in memory
. save part
- If desired, read in the main dataset once more
. use main
and repeat for a different part of the data.
-
preserve and
restore allow
a broadly similar method. To save part of a dataset, the
preserve and restore approach is wrapped up in the
user-written savesome program on SSC. Use the
ssc command to
describe and, if you want, to install this program:
. ssc describe savesome
. ssc install savesome
For alternatives to ssc, see
help findit.
- With many relatively simple divisions of the main dataset in several
parts, your typing would be much reduced by making use of
foreach or
forvalues.
Suppose that you wanted to divide a dataset into 7 “part”
datasets depending on the values 1 to 7 of a classifying variable
group. That is, all
observations with group equal to 1 will go in the first part
dataset, and so forth. Here are two ways of doing that, all of which can
be used interactively:
use main
preserve
foreach i of num 1/7 {
keep if group == `i'
save group`i'
restore, preserve
}
use main
preserve
forval i = 1/7 {
keep if group == `i'
save group`i'
restore, preserve
}
- Extra note: You may want to repeat this process if your
main dataset has changed, if your first attempt did not create the
files you wanted, and so on. If so, you will find that save does not
automatically overwrite existing datasets, which is a useful protection
against accidentally overwriting existing datasets. You can get around
this by specifying the replace option of save.
2 A different, more concise way
While save has these limitations,
use does not. You
can, in fact, split a large set without ever loading it in its entirety.
Suppose again that you wanted to divide a dataset into 7 part datasets
depending on the values 1 to 7 of a classifying variable group. Here
are two other ways of doing that:
foreach i of num 1/7 {
use main if group == `i', clear
save group`i'
}
forval i = 1/7 {
use main if group == `i', clear
save group`i'
}
This approach can be adopted to other similar problems. In particular, you
can also specify a varlist with use.
3 Which way is faster?
It is natural to wonder which method is faster. This question is, however,
difficult to answer because it depends on the size of a dataset, how much memory
you have available, whether you are working over a network, the platform you
are on, and so forth.
It is possible with method 1 that the main dataset is held in memory without
putting it out to disk each time, if the operating system is smart enough to
do that and enough memory is available. But as far as Stata is concerned,
it is put out to disk. Method 2 has the data on disk and requires disk
access.
That said, various experiments with Stata for Linux, for Macintosh, and for
Windows indicate that method 2 is generally faster.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|