Search
   >> Home >> Resources & support >> FAQs >> Saving one or more parts of a dataset
The following material is based on postings to Statalist.

How can I save one or more parts of a large dataset?

Title   Saving one or more parts of a dataset
Author Paul Seed, Wolfson Institute of Preventive Medicine, London
Nicholas J. Cox, Durham University, UK
Jean Marie Linhart, StataCorp
Date May 2002; updated February 2003

The save command does not allow specification either of a varlist, which would be used to specify a subset of variables, or of if or in conditions, which would be used to specify a subset of observations. Extending save in these directions is on the StataCorp to-do list, but, at present, these limitations need to be bypassed in some other way.

We assume the main dataset has previously been saved to a Stata data file in binary format (a .dta file). If not, you should save the data first:

        . save main

1 Use keep or drop first

The first way to save part of a large dataset is to use keep or drop first.

  1. If you wish to save only some variables, then first keep those variables (or if you find it easier, drop variables you do not want).
  2. If you wish to save only some observations, then keep those observations (or if you find it easier, drop observations you do not want).
  3. Now save the data in memory
            . save part 
  4. If desired, read in the main dataset once more
            . use main
    
    and repeat for a different part of the data.
  5. preserve and restore allow a broadly similar method. To save part of a dataset, the preserve and restore approach is wrapped up in the user-written savesome program on SSC. Use the ssc command to describe and, if you want, to install this program:
            . ssc describe savesome
            . ssc install savesome 
    
    For alternatives to ssc, see help findit.
  6. With many relatively simple divisions of the main dataset in several parts, your typing would be much reduced by making use of foreach or forvalues.

    Suppose that you wanted to divide a dataset into 7 “part” datasets depending on the values 1 to 7 of a classifying variable group. That is, all observations with group equal to 1 will go in the first part dataset, and so forth. Here are two ways of doing that, all of which can be used interactively:
      use main 
      preserve 
      foreach i of num 1/7 {
              keep if group == `i'
              save group`i'
              restore, preserve 
      }
    
     use main 
      preserve 
      forval i = 1/7 {
              keep if group == `i'
              save group`i'
              restore, preserve 
      }
    
  7. Extra note: You may want to repeat this process if your main dataset has changed, if your first attempt did not create the files you wanted, and so on. If so, you will find that save does not automatically overwrite existing datasets, which is a useful protection against accidentally overwriting existing datasets. You can get around this by specifying the replace option of save.

2 A different, more concise way

While save has these limitations, use does not. You can, in fact, split a large set without ever loading it in its entirety.

Suppose again that you wanted to divide a dataset into 7 part datasets depending on the values 1 to 7 of a classifying variable group. Here are two other ways of doing that:

 foreach i of num 1/7 {
  use main if group == `i', clear 
  save group`i'
 }
 forval i = 1/7 {
  use main if group == `i', clear 
  save group`i'
 }

This approach can be adopted to other similar problems. In particular, you can also specify a varlist with use.

3 Which way is faster?

It is natural to wonder which method is faster. This question is, however, difficult to answer because it depends on the size of a dataset, how much memory you have available, whether you are working over a network, the platform you are on, and so forth.

It is possible with method 1 that the main dataset is held in memory without putting it out to disk each time, if the operating system is smart enough to do that and enough memory is available. But as far as Stata is concerned, it is put out to disk. Method 2 has the data on disk and requires disk access.

That said, various experiments with Stata for Linux, for Macintosh, and for Windows indicate that method 2 is generally faster.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube