How to open part of a dataset

Sometimes, we want to use part of a dataset. We might wish to use a subset of variables, a subset of observations, or both. First, we type

. copy https://www.stata-press.com/data/r18/nhanes2l.dta nhanes2l.dta

Next, we can type describe using to view information about the contents of the dataset without opening the datafile.

. describe using nhanes2l.dta

Contains data                                 Second National Health and
                                                Nutrition Examination Survey
 Observations:        10,351                  23 Mar 2023 10:43
    Variables:            42



Variable      Storage   Display    Value                                        
    name         type    format    label      Variable label                    

sampl           long    %9.0g                 Unique case identifier
strata          byte    %9.0g                 Stratum identifier
psu             byte    %9.0g      psulbl     Primary sampling unit
region          byte    %9.0g      region     Region
smsa            byte    %22.0g     smsalbl    SMSA type
location        byte    %9.0g                 Location (stand office ID)
houssiz         byte    %9.0g                 Number of people in household
sex             byte    %9.0g      sex        Sex
race            byte    %9.0g      race       Race
age             byte    %9.0g                 Age (years)
height          float   %9.0g                 Height (cm)
weight          float   %9.0g                 Weight (kg)
bpsystol        int     %9.0g                 Systolic blood pressure
bpdiast         int     %9.0g                 Diastolic blood pressure
tcresult        int     %9.0g                 Serum cholesterol (mg/dL)
tgresult        int     %9.0g                 Serum triglycerides (mg/dL)
hdresult        int     %9.0g                 High-density lipids (mg/dL)
hgb             float   %9.0g                 Hemoglobin (g/dL)
hct             float   %9.0g                 Hematocrit (%)
tibc            int     %9.0g                 Total iron bind. cap. (mcg/dL)
iron            int     %9.0g                 Serum iron (mcg/dL)
hlthstat        byte    %20.0g     hlth       Health status
heartatk        byte    %16.0g     heartlbl   Prior heart attack
diabetes        byte    %12.0g     diabetes   Diabetes status
sizplace        byte    %39.0g     size       Size of place
finalwgt        long    %9.0g                 Sampling weight (except lead)
leadwt          long    %9.0g                 Sampling weight for lead
corpuscl        float   %9.0g                 Mean corpuscular volume (fL)
trnsfern        float   %9.0g                 Transferrin saturation (%)
albumin         float   %9.0g                 Serum albumin (g/dL)
vitaminc        float   %9.0g                 Serum vitamin C (mg/dL)
zinc            int     %9.0g                 Serum zinc (mcg/dL)
copper          int     %9.0g                 Serum copper (mcg/dL)
porphyrn        int     %9.0g                 Erythrocyte porphyrin (mcg/dl)
lead            byte    %9.0g                 Lead (mcg/dL)
hsizgp          byte    %8.0g                 # in household or 5 if #>=5
rural           byte    %8.0g      rurallbl   Rural
loglead         float   %9.0g                 log(lead)
agegrp          byte    %8.0g      agegrp     Age group
highlead        byte    %10.0g     highlead   High lead level
bmi             float   %9.0g                 Body mass index (BMI)
highbp          byte    %8.0g                 High blood pressure

Sorted by:

The dataset contains 10,351 observations and 42 variables. Let's say we are interested only in the variables diabetes, agegrp, and bmi. We can include those variable names in our use command, and Stata will load only those variables into memory.

. use diabetes agegrp bmi using nhanes2l
(Second National Health and Nutrition Examination Survey)

We can type describe to view the contents of the data in memory.

. describe

Contains data from nhanes2l.dta
 Observations:        10,351                  Second National Health and
                                                Nutrition Examination Survey
    Variables:             3                  23 Mar 2023 10:43


Variable      Storage   Display    Value                                        
    name         type    format    label      Variable label                    

diabetes        byte    %12.0g     diabetes   Diabetes status
agegrp          byte    %8.0g      agegrp     Age group
bmi             float   %9.0g                 Body mass index (BMI)

Sorted by:

There are 10,351 observations for the variables we requested: diabetes, agegrp, and bmi. Note that the other variables are still present in the dataset in the file, but they are not loaded into Stata's memory.

We can also use a subset of observations from the dataset. Perhaps we want to use only the first 1,000 observations in the dataset. We could do this with the in option.

. use diabetes agegrp bmi using nhanes2l in 1/1000
(Second National Health and Nutrition Examination Survey)

We can type describe and see that the dataset in memory includes 1,000 observations for the variables diabetes, agegrp, and bmi.

. describe

Contains data from nhanes2l.dta
 Observations:         1,000                  Second National Health and
                                                Nutrition Examination Survey
    Variables:             3                  23 Mar 2023 10:43


Variable      Storage   Display    Value                                        
    name         type    format    label      Variable label                    

diabetes        byte    %12.0g     diabetes   Diabetes status
agegrp          byte    %8.0g      agegrp     Age group
bmi             float   %9.0g                 Body mass index (BMI)

Sorted by:

Sometimes, we may wish to restrict the observations based on a variable in the dataset. For example, we may be interested in observations from the Northeastern region of the United States. We can begin by using the variable region.

. use region using nhanes2l.dta
(Second National Health and Nutrition Examination Survey)

Next we can tabulate the variable region with and without the value labels.

. tabulate region


     Region        Freq.     Percent        Cum.
   
         NE        2,096       20.25       20.25
         MW        2,774       26.80       47.05
          S        2,853       27.56       74.61
          W        2,628       25.39      100.00
   
      Total       10,351      100.00            


. tabulate region, nolabel


     Region        Freq.     Percent        Cum.
   
          1        2,096       20.25       20.25
          2        2,774       26.80       47.05
          3        2,853       27.56       74.61
          4        2,628       25.39      100.00
   
      Total       10,351      100.00

The Northeastern region of the United States corresponds to "region==1". So, we can open the dataset using only the observations for region 1 by adding the option if region==1.

. use region diabetes agegrp bmi using nhanes2l if region==1
(Second National Health and Nutrition Examination Survey)

We can type tabulate region to verify that the dataset in memory includes only observations from region 1.

Don't forget that the dataset in the file still contains all the original data. But the dataset in Stata's memory includes only the variables and observations we specified with our use command. If you save the dataset in memory, you will save only the variables and observations in memory, and you will lose all other data in the original datafile. Be sure to save your partial dataset with a new name to avoid losing data.

. save nhanes2l_partial.dta
file nhanes2l_partial.dta saved

You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.

See it in action

Watch Load a subset of data from a Stata dataset.

Tell me more

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies

Region		Freq. Percent Cum.

NE		2,096 20.25 20.25
MW		2,774 26.80 47.05
S		2,853 27.56 74.61
W		2,628 25.39 100.00

Total		10,351 100.00