The following material is based on postings to
Statalist.
How do I select a subset of observations using a complicated criterion?
|
Title
|
|
Selecting a subset of observations with a complicated criterion
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
July 2002; updated July 2011
|
Question
I have a dataset, and I wish to work with a subset of observations, and that
subset is defined by a complicated criterion. (This might be a long list of
identifiers or some other codes specifying which observations belong in the
subset.) What is the easiest way to do this?
Answer
Before starting to answer, let us indicate just two situations in which this
question might arise. You might wish to work with a smaller dataset
that is either defined by this criterion or that is its complement. Or, you might
wish to define a dummy variable (say, indicating firms subject to some
particular regulation, which fall only into certain industries). Our example
code focuses on the first, but, essentially, the same basic idea applies to
the second.
1 The obvious but tedious way
You already know one solution: using a complicated
if
condition. It is just that you really would rather not type out some long
line like
. keep if id == 12 | id == 23 | id == 34 | id == 45 | and so on, and so on
In practice, what you type should never be as long as this
example implies. You can cut down typing substantially by using functions such as
inlist() and
inrange().
(We will mention just once that it may be easier to use
drop
on the complementary subset, rather than using
keep
on the subset you want, a point that applies throughout this FAQ.)
2 A less general, but sometimes useful, way
Repeated typing of various
syntax elements is part of what makes this approach difficult.
Questions like this arise frequently, so we need other
methods. There is another way to approach selection whenever
equality with any of several integer values is the criterion.
. egen OK = anymatch(id), values(12 23 34 45 and so on)
. keep if OK
The first statement uses the
egen
command. Read this as generate the new variable OK that is 1 (true)
if id is equal to any of the values specified and 0 otherwise.
anymatch() in Stata 9 and later releases is a replacement for
eqany() in Stata 8 and prior releases. Crucially, the
argument of values() may be a
numlist, so,
to give only one example, unbroken sequences of integers may be specified
concisely. This function is similar to using inlist() or
inrange() with if, as mentioned above.
3 Another way, using merge
A third way uses merge.
You may need to get around a mental block that
merge is a command that produces larger datasets; i.e., "this
dataset" plus "that dataset". It does precisely that, but, in the
problems discussed here, the useful product is the intersection, not the
union.
To make matters concrete, let us suppose that main.dta contains
observations and an identifier variable id, and we wish to select
observations for some of those identifiers.
This method is free of any limits imposed by restrictions on how long a
command line (section 1) or an option argument (section 2) may be.
4 A shortcut with the above when identifiers are numeric
Suppose you have numeric identifiers given by ranges like 1/2 34/56
678/901 or, more generally specifiable, as a
numlist. Clearly,
you would not want to type in a dataset containing all the individual
identifiers. Here is an alternative:
. clear
. numlist "1/2 34/56 678/901"
. tokenize `r(numlist)'
. local N : word count `r(numlist)'
. set obs `N'
. gen id = .
. forval i = 1 / `N' {
. qui replace id = ``i'' in `i'
. }
In other words, the numlist command expands the abbreviated
numlist into its individual elements 1 2 34 35 ... 900 901,
tokenize puts
those individual elements into local macros, and the other commands then
put them into values of the variable id. This shortcut should work
for up to 1600 elements. See
limits.
5 A related FAQ
See also the FAQ How do you efficiently define group characteristics in
your data in order to create subsets? at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/, which was
written by Kit Baum. Kit provided helpful comments for this FAQ as well.
|