How can I sample clusters, not individuals?
Sampling clusters, not individuals
Nicholas J. Cox, Durham University, UK
Scott Merryman, Risk Management Agency/USDA
April 2006; updated July 2011
Often you need to sample clusters, not individuals. Suppose you have a
dataset with individual people from several households, but you wish to
sample households randomly, not individuals. Here are two ways to do so.
Much of what we do here is also feasible through sample2
In each selection, clusters are chosen on random numbers produced by using
If you are serious about replicating your research, you will need to set
(see help set seed)
before generating the random numbers.
One way to accomplish our goal would be to
one observation from each household, randomly
from the remaining observations, and then
back to the original dataset.
For example, assume a household identifier hhid:
bysort hhid: keep if _n == 1
merge m:1 hhid using `tmp'
keep if _merge == 3
the dataset before keeping just one observation from each household,
and then use sample to select an approximate 10%
sample. We save
this sample to a temporary file. We then
original dataset and merge with the saved dataset. The part of the
dataset that we want is indicated by _merge == 3. A successful
merge depends on a previous
sort of both datasets.
If you want to take a sample that is not a particular percentage of the
dataset but rather has an exact sample size, use sample, count.
Look, no file choreography
We now show another solution, all in place with no pas de deux of
dancing files and without using sample. Knowing how to do
it using basic principles may appeal to you.
First, keep a copy of the sort order:
gen long order = _n
Then select one observation from each household:
egen select = tag(hhid)
Now produce some random numbers and sort:
gen rnd = runiform()
sort select rnd
One observation per household has now been sorted to the end, and those
observations have been shuffled on the fly, courtesy of the random numbers.
Suppose you want 10 of 100 households:
replace select = _n > (_N - 10)
The indicator select is now 1 for the last 10 observations
and 0 otherwise. Now we spread the word of being selected
among the household members:
bysort hhid (select): replace select = select[_N]
Finally, go back to the original sort order, and clean up:
drop order rnd
This variation keeps both the selected sample, for which select ==
1, and the other observations, for which select == 0. If you
wanted the sample observations, then
drop if !select or keep if
To learn more
In addition to the usual online help or manual entries, see
for a discussion of sampling individuals.
- Weesie, J. 1997.
dm46: Enhancement to the sample command.
Stata Technical Bulletin 37: 6–7. Reprinted in
Stata Technical Bulletin Reprints, vol. 7, pp. 37–38.