Stata | FAQ: Saving one or more parts of a dataset

Home / Resources & support / FAQs / Saving one or more parts of a dataset

The following material is based on postings to Statalist.

How can I save one or more parts of a large dataset?

Title		Saving one or more parts of a dataset
Author		Paul Seed, Wolfson Institute of Preventive Medicine, London Nicholas J. Cox, Durham University, UK Jean Marie Linhart, StataCorp

The save command does not allow specification either of a varlist, which would be used to specify a subset of variables, or of if or in conditions, which would be used to specify a subset of observations.

We assume the main dataset has previously been saved to a Stata data file in binary format (a .dta file). If not, you should save the data first:

        . save main

1 Use keep or drop first

The first way to save part of a large dataset is to use keep or drop first.

If you wish to save only some variables, then first keep those variables (or if you find it easier, drop variables you do not want).
If you wish to save only some observations, then keep those observations (or if you find it easier, drop observations you do not want).
Now save the data in memory
```
        . save part 
```
If desired, read in the main dataset once more
```
        . use main
```
and repeat for a different part of the data.
preserve and restore allow a broadly similar method. To save part of a dataset, the preserve and restore approach is wrapped up in the community-contributed savesome program on SSC. Use the ssc command to describe and, if you want, to install this program:
```
        . ssc describe savesome
        . ssc install savesome 
```
For alternatives to ssc, see help search.
With many relatively simple divisions of the main dataset in several parts, your typing would be much reduced by making use of foreach or forvalues.

Suppose that you wanted to divide a dataset into 7 “part” datasets depending on the values 1 to 7 of a classifying variable group. That is, all observations with group equal to 1 will go in the first part dataset, and so forth. Here are two ways of doing that, both of which can be used interactively:
```
 use main 
 preserve 
 foreach i of num 1/7 {
         keep if group == `i'
         save group`i'
         restore, preserve 
 }
```
```
 use main 
 preserve 
 forval i = 1/7 {
         keep if group == `i'
         save group`i'
         restore, preserve 
 }
```
Extra note: You may want to repeat this process if your main dataset has changed, if your first attempt did not create the files you wanted, and so on. If so, you will find that save does not automatically overwrite existing datasets, which is a useful protection against accidentally overwriting existing datasets. You can get around this by specifying the replace option of save.

2 A different, more concise way

While save has these limitations, use does not. You can, in fact, split a large set without ever loading it in its entirety.

Suppose again that you wanted to divide a dataset into 7 part datasets depending on the values 1 to 7 of a classifying variable group. Here are two other ways of doing that:

 foreach i of num 1/7 {
  use main if group == `i', clear 
  save group`i'
 }

 forval i = 1/7 {
  use main if group == `i', clear 
  save group`i'
 }

This approach can be adopted to other similar problems. In particular, you can also specify a varlist with use.

3 Which way is faster?

It is natural to wonder which method is faster. This question is, however, difficult to answer because it depends on the size of a dataset, how much memory you have available, whether you are working over a network, the platform you are on, and so forth.

It is possible with method 1 that the main dataset is held in memory without putting it out to disk each time, if the operating system is smart enough to do that and enough memory is available. But as far as Stata is concerned, it is put out to disk. Method 2 has the data on disk and requires disk access.

That said, various experiments with Stata for Linux, for Macintosh, and for Windows indicate that method 2 is generally faster.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How can I save one or more parts of a large dataset?

1 Use keep or drop first

2 A different, more concise way

3 Which way is faster?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How can I save one or more parts of a large dataset?

1 Use keep or drop first

2 A different, more concise way

3 Which way is faster?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies