Stata
Products Purchase Support Company
Search
   >> Home >> Resources & support >> FAQs >> Saving one or more parts of a dataset
The following material is based on postings to Statalist.

How can I save one or more parts of a large dataset?

Title   Saving one or more parts of a dataset
Author Paul Seed, Wolfson Institute of Preventive Medicine, London
Nicholas J. Cox, Durham University, UK
Jean Marie Linhart, StataCorp
Date May 2002; updated February 2003

The save command does not allow specification either of a varlist, which would be used to specify a subset of variables, or of if or in conditions, which would be used to specify a subset of observations. Extending save in these directions is on the StataCorp to-do list, but at present these limitations need to be bypassed in some other way.

We assume that the main dataset has previously been saved to a Stata data file in binary format (a .dta file). If not, you should save the data first:

        . save main

1 Use keep or drop first

The first way to save part of a large dataset is to use keep or drop first.

  1. If you wish to save only some variables, then first keep those variables (or if you find it easier, drop variables you do not want).
  2. If you wish to save only some observations, then keep those observations (or if you find it easier, drop observations you do not want).
  3. Now save the data in memory
            . save part 
  4. If desired, read in the main dataset once more
            . use main
    
    and repeat for a different part of the data.
  5. preserve and restore allow a broadly similar method. To save part of a dataset, the preserve and restore approach is wrapped up in the user-written savesome program on SSC. Use the ssc command to describe and, if you want, to install this program:
            . ssc describe savesome
            . ssc install savesome 
    
    For alternatives to ssc, see help findit.
  6. With many relatively simple divisions of the main dataset in several parts, your typing would be much reduced by making use of foreach or forvalues.

    Suppose that you wanted to divide a dataset into 7 “part” datasets depending on the values 1 to 7 of a classifying variable group. That is, all observations with group equal to 1 will go in the first part dataset, and so forth. Here are two ways of doing that, all of which can be used interactively:
      use main 
      preserve 
      foreach i of num 1/7 {
              keep if group == `i'
              save group`i'
              restore, preserve 
      }
    
     use main 
      preserve 
      forval i = 1/7 {
              keep if group == `i'
              save group`i'
              restore, preserve 
      }
    
  7. Extra note: You may want to repeat this process; for example, if your main dataset has changed, or if your first attempt did not create the files you wanted. If so, you will find that save does not automatically overwrite existing datasets, which is a useful protection against accidentally overwriting existing datasets. You can get around this by specifying the replace option of save.

2 A different, more concise way

Another approach is to note that while save has these limitations, use does not. You can, in fact, split a large set without ever loading it in its entirety.

Suppose again that you wanted to divide a dataset into 7 part datasets depending on the values 1 to 7 of a classifying variable group. Here are two other ways of doing that:

 foreach i of num 1/7 {
  use main if group == `i', clear 
  save group`i'
 }
 forval i = 1/7 {
  use main if group == `i', clear 
  save group`i'
 }

This approach can be adopted to other similar problems. In particular, you can also specify a varlist with use.

3 Which way is faster?

It is natural to wonder which method is faster. This question is, however, difficult to answer, as it depends on the size of a dataset, how much memory you have available, whether you are working over a network, the platform you are on, and so forth.

It is possible with method 1 that the main dataset is held in memory without putting it out to disk each time, if the operating system is smart enough to do that and enough memory is available. But as far as Stata is concerned, it is put out to disk. Method 2 has the data on disk and requires disk access.

That said, various experiments with Stata for Linux, for Macintosh, and for Windows indicate that method 2 is generally faster.

FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Macintosh
Technical support
Resources & support
FAQs
Technical support
NetCourses
Short courses
Users Group meetings
Statalist
Links
Software updates
Software archives
Customer service
Manuals & supplements
Stata Journal
STB
Stata News
Stata Automation
Plugins

Site overview
Products
Resources & support
Company
Site index

© Copyright 1996–2008 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index