How can I sample clusters, not individuals?
|
Title
|
|
Sampling clusters, not individuals
|
|
Authors
|
Nicholas J. Cox, Durham University, UK
Scott Merryman, Risk Management Agency/USDA
|
|
Date
|
April 2006; updated July 2011
|
Introduction
Often you need to sample clusters, not individuals. Suppose you have a
dataset with individual people from several households, but you wish to
sample households randomly, not individuals. Here are two ways to do so.
Much of what we do here is also feasible through sample2
(Weesie 1997).
In each selection, clusters are chosen on random numbers produced by using
runiform().
If you are serious about replicating your research, you will need to set
your seed
(see help set seed)
before generating the random numbers.
Merge solution
One way to accomplish our goal would be to
keep
one observation from each household, randomly
sample
from the remaining observations, and then
merge
back to the original dataset.
For example, assume a household identifier hhid:
sort hhid
preserve
tempfile tmp
bysort hhid: keep if _n == 1
sample 10
sort hhid
save `tmp'
restore
merge m:1 hhid using `tmp'
keep if _merge == 3
drop _merge
We preserve
the dataset before keeping just one observation from each household,
and then use sample to select an approximate 10%
sample. We save
this sample to a temporary file. We then
restore the
original dataset and merge with the saved dataset. The part of the
dataset that we want is indicated by _merge == 3. A successful
merge depends on a previous
sort of both datasets.
If you want to take a sample that is not a particular percentage of the
dataset but rather has an exact sample size, use sample, count.
Look, no file choreography
We now show another solution, all in place with no pas de deux of
dancing files and without using sample. Knowing how to do
it using basic principles may appeal to you.
First, keep a copy of the sort order:
gen long order = _n
Then select one observation from each household:
egen select = tag(hhid)
Now produce some random numbers and sort:
gen rnd = runiform()
sort select rnd
One observation per household has now been sorted to the end, and those
observations have been shuffled on the fly, courtesy of the random numbers.
Suppose you want 10 of 100 households:
replace select = _n > (_N - 10)
The indicator select is now 1 for the last 10 observations
and 0 otherwise. Now we spread the word of being selected
among the household members:
bysort hhid (select): replace select = select[_N]
Finally, go back to the original sort order, and clean up:
sort order
drop order rnd
This variation keeps both the selected sample, for which select ==
1, and the other observations, for which select == 0. If you
wanted the sample observations, then
drop if !select or keep if
select.
To learn more
In addition to the usual online help or manual entries, see
http://www.stata.com/support/faqs/statistics/random-samples
for a discussion of sampling individuals.
Reference
- Weesie, J. 1997.
-
dm46: Enhancement to the sample command.
Stata Technical Bulletin 37: 6–7. Reprinted in
Stata Technical Bulletin Reprints, vol. 7, pp. 37–38.
|