Title | Random samples from an existing dataset | |

Author | Nicholas J. Cox, Durham University, UK | |

I have a dataset, and I wish to take one or more random subsamples. How can I do this in Stata?

First, we go through the solution when that is the whole problem. Then we indicate solutions when sampling should be done within each of a set of categories.

There are two overarching questions:

If the sample is to be taken with replacement, then each observation from
the dataset may appear in the sample not at all, once, or more than once.
What you want may be some bootstrap or similar resampling command, for
example,
**bootstrap** or
**bsample**.

If the sample is to be taken without replacement, then each observation from the dataset may appear in the sample not at all or once. The rest of this FAQ is based on the assumption that you are sampling without replacement and that the number of observations in memory is large enough for you to choose one or more samples of the size specified.

The Stata command
**sample** codifies
one approach to choosing a sample without replacement. The concern here is
with explaining enough basic ideas that you can produce your own random
samples as desired in Stata with a combination of elementary Stata commands.

These identifiers will usually be integer or string codes that specify a particular observation, such as the record for a particular person or organization in your dataset. Identifiers are not essential for all problems, but you will need them, for example, whenever you are choosing some observations from a large dataset for further study.

Do you want to use the whole dataset, or just part? If you want to use just part, set up an indicator variable that flags the part you want to use; for example,

. generate OK = (gender == 1) . generate OK = age >= 40 & age < .

The condition on the right-hand side of the assignment will be true for the
observations to be used. The resulting indicator variable **OK** will be
1 if and only if observations satisfy the stated condition (and 0
otherwise). Care may be needed if missing data are present: for numeric
variables, missing values count as higher than any other numeric value, so
**age >= 40** would include missing ages. Hence, we should stipulate
the extra condition **age < .** in generating the indicator variable.

Next type

. generate random = runiform()

This command generates a set of pseudorandom numbers from a uniform
distribution on [0,1). If you want to document your results, or if you care
about precise reproducibility of results, then you will set the seed
explicitly. See
**generate** and
**functions** .

You can shuffle the observations in memory by sorting on the random numbers just generated:

. sort random

You are now in a position to choose a random sample. For example, suppose that you want a sample of size 100. This could be the first 100 or the last 100 observations; for example,

. generate insample = _n <= 100

or

. generate insample = (_N - _n) < 100

If sampling is only from within observations that are **OK**, as above,
then

. sort OK random . generate insample = OK & (_N - _n) < 100

If you want two or more distinct random subsamples, you can extend this approach. For example, if you want two groups of equal size

. sort random . generate group = ceil(2 * _n/_N)

creates a variable with two categories, 1 and 2, with approximately equal
numbers in each category. These values indicate subsamples 1 and 2.
Clearly, if the number of observations is even, this can be done exactly,
whereas if it is odd, the two subsamples will differ in size by 1. In the
data, a block of observations for which **group** is 1 is followed by a
block for which **group** is 2. To see this

. tabulate group . list group

For three or more groups, use the number required in place of **2**, with
similar results. For the more complicated case in which, to be selected,
sample observations must also be **OK**.

. sort OK random . by OK: generate group = ceil(2 * _n/_N) if OK

and so forth.

If you want groups of unequal size, then usually the definition will use the
observation number **_n**. If you had 1,000 observations and you wanted
to subdivide into subsamples of 200 and 800, then the statement could be

. gen group = 1 + (_n > 200)

**_n > 200** is false (0) for the first 200 observations and true (1)
for the rest. Adding 1 produces a categorical variable with values 1 and 2.

The most complicated kind of problem considered here is illustrated by this example:

- We want to work with people for whom
**age <= 40**(and is not missing). - We have males and females and want to subdivide each gender into two equal groups.

Once again, we might need to be careful about **gender** being missing.

Putting all the ideas together, we get

. gen byte OK = age <= 40 & age < . & gender < . . gen random = runiform() . sort OK gender random . by OK gender: gen group = ceil(2 * _n/_N) if OK

Several Stata ideas are being used here.

Sorting first on **OK** segregates all the observations that we do not
want to use. Sorting second on **gender** ensures later equal subdivision
within **gender**. Sorting last on **random** shuffles our data
within those categories. Finally, generating the categorical variable within
each **gender** group produces the random splitting.

As a result of these statements, there are two classifications of the data,
by **gender** and by **group**.

For more than two groups, replace **2** in the last statement with the
number required.

In all these examples, Stata commands have produced variables that identify
the observations in each subsample. Typically the next step is to carry out
computations for such subsamples. For example, computations for the sample
defined by the variable **insample** will specify **if insample == 1**
or, more concisely, **if insample**. Similarly, computations for the two
or more samples defined by the variable
**group** may use qualifiers such as **if group == 1** or **if group
== 2**.