How can I take random samples from an existing dataset?
|
Title
|
|
Random samples from an existing dataset
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
December 2000; updated July 2005; minor revisions July 2009
|
Question:
I have a dataset, and I wish to take one or more random subsamples. How can
I do this in Stata?
Answer:
First, we go through the solution when that is the whole problem. Then we
indicate solutions when sampling should be done within each of a set of
categories.
There are two overarching questions:
1. Is the sample to be taken with or without replacement?
If the sample is to be taken with replacement, then each observation from
the dataset may appear in the sample not at all, once, or more than once.
What you want may be some bootstrap or similar resampling command, for
example,
bootstrap or
bsample.
If the sample is to be taken without replacement, then each observation from
the dataset may appear in the sample not at all or once. The rest of this
FAQ is based on the assumption that you are sampling without replacement and
that the number of observations in memory is large enough for you to choose
one or more samples of the size specified.
The Stata command
sample codifies
one approach to choosing a sample without replacement. The concern here is
with explaining enough basic ideas that you can produce your own random
samples as desired in Stata with a combination of elementary Stata commands.
2. Are observations already labeled by unique identifiers?
These identifiers will usually be integer or string codes that specify a
particular observation, such as the record for a particular person or
organization in your dataset. Identifiers are not essential for all
problems, but you will need them, for example, whenever you are choosing
some observations from a large dataset for further study.
Subsetting the dataset, with no extra subdivision by categories
Do you want to use the whole dataset, or just part? If you want to use just
part, set up an indicator variable that flags the part you want to use; for
example,
. generate OK = (gender == 1)
. generate OK = age >= 40 & age < .
The condition on the right-hand side of the assignment will be true for the
observations to be used. The resulting indicator variable OK will be
1 if and only if observations satisfy the stated condition (and 0
otherwise). Care may be needed if missing data are present: for numeric
variables, missing values count as higher than any other numeric value, so
age >= 40 would include missing ages. Hence, we should stipulate
the extra condition age < . in generating the indicator variable.
Next type
. generate random = runiform()
This command generates a set of pseudorandom numbers from a uniform
distribution on [0,1). If you want to document your results, or if you care
about precise reproducibility of results, then you will set the seed
explicitly. See
generate and
functions .
Shuffle your data randomly, and subdivide into groups
You can shuffle the observations in memory by sorting on the random numbers
just generated:
. sort random
You are now in a position to choose a random sample. For example, suppose
that you want a sample of size 100. This could be the first 100 or the last
100 observations; for example,
. generate insample = _n <= 100
or
. generate insample = (_N - _n) < 100
If sampling is only from within observations that are OK, as above,
then
. sort OK random
. generate insample = OK & (_N - _n) < 100
If you want two or more distinct random subsamples, you can extend this
approach. For example, if you want two groups of equal size
. sort random
. generate group = group(2)
creates a variable with two categories, 1 and 2, with approximately equal
numbers in each category. These values indicate subsamples 1 and 2.
Clearly, if the number of observations is even, this can be done exactly,
whereas if it is odd, the two subsamples will differ in size by 1. In the
data, a block of observations for which group is 1 is followed by a
block for which group is 2. To see this
. tabulate group
. list group
For three or more groups, use the number required in place of 2, with
similar results. For the more complicated case in which, to be selected,
sample observations must also be OK.
. sort OK random
. by OK: generate group = group(2) if OK
and so forth.
If you want groups of unequal size, then usually the definition will use the
observation number _n. If you had 1,000 observations and you wanted
to subdivide into subsamples of 200 and 800, then the statement could be
. gen group = 1 + (_n > 200)
_n > 200 is false (0) for the first 200 observations and true (1)
for the rest. Adding 1 produces a categorical variable with values 1 and 2.
Subsetting the dataset and within categories as well
The most complicated kind of problem considered here is illustrated by this
example:
- We want to work with people for whom age <= 40 (and is not missing).
- We have males and females and want to subdivide each gender into two equal groups.
Once again, we might need to be careful about gender being missing.
Putting all the ideas together, we get
. gen byte OK = age <= 40 & age < . & gender < .
. gen random = runiform()
. sort OK gender random
. by OK gender: gen group = group(2) if OK
Several Stata ideas are being used here.
Sorting first on OK segregates all the observations that we do not
want to use. Sorting second on gender ensures later equal subdivision
within gender. Sorting last on random shuffles our data
within those categories. Finally, generating the categorical variable within
each gender group produces the random splitting.
As a result of these statements, there are two classifications of the data,
by gender and by group.
For more than two groups, replace 2 in the last statement with the
number required.
In all these examples, Stata commands have produced variables that identify
the observations in each subsample. Typically the next step is to carry out
computations for such subsamples. For example, computations for the sample
defined by the variable insample will specify if insample == 1
or, more concisely, if insample. Similarly, computations for the two
or more samples defined by the variable
group may use qualifiers such as if group == 1 or if group
== 2.
|