Title | Random samples from an existing dataset | |
Author | Nicholas J. Cox, Durham University, UK | |
I have a dataset, and I wish to take one or more random subsamples. How can I do this in Stata?
First, we go through the solution when that is the whole problem. Then we indicate solutions when sampling should be done within each of a set of categories.
There are two overarching questions:
If the sample is to be taken with replacement, then each observation from the dataset may appear in the sample not at all, once, or more than once. What you want may be some bootstrap or similar resampling command, for example, bootstrap or bsample.
If the sample is to be taken without replacement, then each observation from the dataset may appear in the sample not at all or once. The rest of this FAQ is based on the assumption that you are sampling without replacement and that the number of observations in memory is large enough for you to choose one or more samples of the size specified.
The Stata command sample codifies one approach to choosing a sample without replacement. The concern here is with explaining enough basic ideas that you can produce your own random samples as desired in Stata with a combination of elementary Stata commands.
These identifiers will usually be integer or string codes that specify a particular observation, such as the record for a particular person or organization in your dataset. Identifiers are not essential for all problems, but you will need them, for example, whenever you are choosing some observations from a large dataset for further study.
Do you want to use the whole dataset, or just part? If you want to use just part, set up an indicator variable that flags the part you want to use; for example,
. generate OK = (gender == 1) . generate OK = age >= 40 & age < .
The condition on the right-hand side of the assignment will be true for the observations to be used. The resulting indicator variable OK will be 1 if and only if observations satisfy the stated condition (and 0 otherwise). Care may be needed if missing data are present: for numeric variables, missing values count as higher than any other numeric value, so age >= 40 would include missing ages. Hence, we should stipulate the extra condition age < . in generating the indicator variable.
Next type
. generate random = runiform()
This command generates a set of pseudorandom numbers from a uniform distribution on [0,1). If you want to document your results, or if you care about precise reproducibility of results, then you will set the seed explicitly. See generate and functions.
You can shuffle the observations in memory by sorting on the random numbers just generated:
. sort random
You are now in a position to choose a random sample. For example, suppose that you want a sample of size 100. This could be the first 100 or the last 100 observations; for example,
. generate insample = _n <= 100
or
. generate insample = (_N - _n) < 100
If sampling is only from within observations that are OK, as above, then
. sort OK random . generate insample = OK & (_N - _n) < 100
If you want two or more distinct random subsamples, you can extend this approach. For example, if you want two groups of equal size
. sort random . generate group = ceil(2 * _n/_N)
creates a variable with two categories, 1 and 2, with approximately equal numbers in each category. These values indicate subsamples 1 and 2. Clearly, if the number of observations is even, this can be done exactly, whereas if it is odd, the two subsamples will differ in size by 1. In the data, a block of observations for which group is 1 is followed by a block for which group is 2. To see this
. tabulate group . list group
For three or more groups, use the number required in place of 2, with similar results. For the more complicated case in which, to be selected, sample observations must also be OK.
. sort OK random . by OK: generate group = ceil(2 * _n/_N) if OK
and so forth.
If you want groups of unequal size, then usually the definition will use the observation number _n. If you had 1,000 observations and you wanted to subdivide into subsamples of 200 and 800, then the statement could be
. gen group = 1 + (_n > 200)
_n > 200 is false (0) for the first 200 observations and true (1) for the rest. Adding 1 produces a categorical variable with values 1 and 2.
The most complicated kind of problem considered here is illustrated by this example:
Once again, we might need to be careful about gender being missing.
Putting all the ideas together, we get
. gen byte OK = age <= 40 & age < . & gender < . . gen random = runiform() . sort OK gender random . by OK gender: gen group = ceil(2 * _n/_N) if OK
Several Stata ideas are being used here.
Sorting first on OK segregates all the observations that we do not want to use. Sorting second on gender ensures later equal subdivision within gender. Sorting last on random shuffles our data within those categories. Finally, generating the categorical variable within each gender group produces the random splitting.
As a result of these statements, there are two classifications of the data, by gender and by group.
For more than two groups, replace 2 in the last statement with the number required.
In all these examples, Stata commands have produced variables that identify the observations in each subsample. Typically the next step is to carry out computations for such subsamples. For example, computations for the sample defined by the variable insample will specify if insample == 1 or, more concisely, if insample. Similarly, computations for the two or more samples defined by the variable group may use qualifiers such as if group == 1 or if group == 2.