Home  /  Resources & support  /  FAQs  /  Selecting a subset of observations with a complicated criterion
The following material is based on postings to Statalist.

How do I select a subset of observations using a complicated criterion?

Title   Selecting a subset of observations with a complicated criterion
Author Nicholas J. Cox, Durham University, UK

Question

I have a dataset, and I wish to work with a subset of observations, and that subset is defined by a complicated criterion. (This might be a long list of identifiers or some other codes specifying which observations belong in the subset.) What is the easiest way to do this?

Answer

Before starting to answer, let us indicate just two situations in which this question might arise. You might wish to work with a smaller dataset that is defined by either this criterion or its complement. Or, you might wish to define an indicator or dummy variable (say, indicating firms subject to some particular regulation, which fall only into certain industries). Our example code focuses on the first, but, essentially, the same basic idea applies to the second.

1 The obvious but tedious way

You already know one solution: using a complicated if condition. It is just that you really would rather not type out some long line like

        . keep if id == 12 | id == 23 | id == 34 | id == 45 | and so on, and so on

In practice, what you type should never be as long as this example implies. You can cut down typing substantially by using functions such as inlist() and inrange().

(We will mention just once that it may be easier to use drop on the complementary subset, rather than using keep on the subset you want, a point that applies throughout this FAQ.)

2 A less general, but sometimes useful, way

Repeated typing of various syntax elements is part of what makes this approach difficult. Questions like this arise frequently, so we need other methods. There is another way to approach selection whenever equality with any of several integer values is the criterion.

        . egen OK = anymatch(id), values(12 23 34 45 and so on)
        . keep if OK

The first statement uses the egen command. Read this as generate the new variable OK that is 1 (true) if id is equal to any of the values specified and 0 otherwise. anymatch() in Stata 9 and later releases is a replacement for eqany() in Stata 8 and prior releases. Crucially, the argument of values() may be a numlist, so, to give only one example, unbroken sequences of integers may be specified concisely. This function is similar to using inlist() or inrange() with if, as mentioned above.

3 Another way, using merge

A third way uses merge. You may need to get around a mental block that merge is a command that produces larger datasets; i.e., "this dataset" plus "that dataset". It does precisely that, but, in the problems discussed here, the useful product is the intersection, not the union.

To make matters concrete, let us suppose that main.dta contains observations and an identifier variable id, and we wish to select observations for some of those identifiers.

  • save main.dta, sorted on id, if you have not already done so.
  • clear those data, and create a dataset in Stata containing only the identifiers you want, using the same variable name id, with the same variable type as in main.dta, and sorted on id.
  • Now type
            . merge 1:m id using main
    
  • Observations with values for _merge of 3 are those which you want; that is, they form the overlap or intersection of the two datasets. Hence,
            . keep if _merge == 3
            . drop _merge
    

This method is free of any limits imposed by restrictions on how long a command line (section 1) or an option argument (section 2) may be.

4 A shortcut with the above when identifiers are numeric

Suppose you have numeric identifiers given by ranges like 1/2 34/56 678/901 or, more generally specifiable, as a numlist. Clearly, you would not want to type in a dataset containing all the individual identifiers. Here is an alternative:

        . clear                
        . numlist "1/2 34/56 678/901"
        . tokenize `r(numlist)' 
        . local N : word count `r(numlist)' 
        . set obs `N' 
        . gen id = . 
        . forval i = 1 / `N' { 
        .	qui replace id = ``i'' in `i'
        . }

In other words, the numlist command expands the abbreviated numlist into its individual elements 1 2 34 35 ... 900 901, tokenize puts those individual elements into local macros, and the other commands then put them into values of the variable id. This shortcut should work for up to 2,500 elements. See help limits.

5 A related FAQ

See also the FAQ How do you efficiently define group characteristics in your data in order to create subsets?, which was written by Kit Baum. Kit provided helpful comments for this FAQ as well.