Stata
Products Purchase Support Company
Search
   >> Home >> Resources & support >> FAQs >> Selecting a subset of observations with a complicated criterion
The following material is based on postings to Statalist.

How do I select a subset of observations using a complicated criterion?

Title   Selecting a subset of observations with a complicated criterion
Author Nicholas J. Cox, Durham University, UK
Date July 2002; updated April 2005

Question

I have a dataset, and I wish to work with a subset of observations, and that subset is defined by a complicated criterion. (This might be a long list of identifiers or some other codes specifying which observations belong in the subset.) What is the easiest way to do this?

Answer

Before starting to answer, let us indicate just two situations in which this question might arise. You might wish to work with a smaller dataset, either that is defined by this criterion, or that is its complement. Or, you might wish to define a dummy variable (say, indicating firms subject to some particular regulation, which fall only into certain industries). Our example code focuses on the first, but, essentially, the same basic idea applies to the second.

1 The obvious but tedious way

You already know one solution: using a complicated if condition. It is just that you really would rather not type out some long line like

        . keep if id == 12 | id == 23 | id == 34 | id == 45 | and so on, and so on

In practice, what you type need never be as long as that implied by this example. You can cut down typing substantially by using functions such as inlist() and inrange().

(We will mention just once that it may be easier to use drop on the complementary subset, rather than using keep on the subset you want, a point that applies throughout this FAQ.)

2 A less general, but sometimes useful, way

Part of the difficulty in the first approach is repeated typing of various syntax elements. Questions like this arise frequently, so we need other methods. There is another way to approach selection whenever the criterion is equality with any of several integer values.

        . egen OK = anymatch(id), values(12 23 34 45 and so on)
        . keep if OK

The first statement uses the egen command. Read this as generate the new variable OK that is 1 (true) if id is equal to any of the values specified, and 0 otherwise. anymatch() in Stata 9 and later releases is a replacement for eqany() in Stata 8 and prior releases. A key detail is that the argument of values() may be a numlist, and so, to give only one example, unbroken sequences of integers may be specified concisely. This function is similar to being able to use inlist() or inrange() with if, as mentioned above.

3 Another way, using merge

A third way uses merge. The small mental block that you may need to get around is any fixed idea that merge is a command that produces larger datasets; i.e., "this dataset" plus "that dataset". It does precisely that combination, but in the problems discussed here, the useful product is the intersection, not the union.

To make matters concrete, let us suppose that main.dta contains observations and an identifier variable id and that we wish to select observations for some of those identifiers.

  • save main.dta, sorted on id, if you have not already done so.
  • clear those data, and create a dataset in Stata containing only the identifiers you want, using the same variable name id, with the same variable type as in main.dta, and sorted on id.
  • Now type
            . merge id using main
    
  • Observations with values for _merge of 3 are those which you want; that is, they form the overlap or intersection of the two datasets. Hence,
            . keep if _merge == 3
            . drop _merge
    

This method is free of any limits imposed by restrictions on how long a command line (section 1) or an option argument (section 2) may be.

4 A shortcut with the above when identifiers are numeric

Suppose that you have numeric identifiers given by ranges like 1/2 34/56 678/901 or more generally specifiable as a numlist. Clearly, you would not want to type in a dataset containing all the individual identifiers. Here is an alternative:

        . clear                
        . numlist "1/2 34/56 678/901"
        . tokenize `r(numlist)' 
        . local N : word count `r(numlist)' 
        . set obs `N' 
        . gen id = . 
        . forval i = 1 / `N' { 
        .	qui replace id = ``i'' in `i'
        . }

In other words, the numlist command expands the abbreviated numlist into its individual elements 1 2 34 35 ... 900 901, tokenize puts those individual elements into local macros, and the other commands then puts them into values of the variable id. This shortcut should work for up to 1600 elements. See limits.

5 A related FAQ

See also the FAQ How do you efficiently define group characteristics in your data in order to create subsets? at http://www.stata.com/support/faqs/data/characteristics.html, which was written by Kit Baum. Kit provided helpful comments for this FAQ as well.

FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Macintosh
Technical support
Resources & support
FAQs
Technical support
NetCourses
Short courses
Users Group meetings
Statalist
Links
Software updates
Software archives
Customer service
Manuals & supplements
Stata Journal
STB
Stata News
Stata Automation
Plugins

Site overview
Products
Resources & support
Company
Site index

© Copyright 1996–2008 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index