Home  /  Resources & support  /  FAQs  /  Sampling clusters, not individuals

How can I sample clusters, not individuals?

Title   Sampling clusters, not individuals
Authors Nicholas J. Cox, Durham University, UK
Scott Merryman, Risk Management Agency/USDA

Introduction

Often you need to sample clusters, not individuals. Suppose you have a dataset with individual people from several households, but you wish to sample households randomly, not individuals. Here are two ways to do so. Much of what we do here is also feasible through sample2 (Weesie 1997).

In each selection, clusters are chosen on random numbers produced by using runiform(). If you are serious about replicating your research, you will need to set your seed (see set seed) before generating the random numbers.

Merge solution

One way to accomplish our goal would be to keep one observation from each household, randomly sample from the remaining observations, and then merge back to the original dataset.

For example, assume a household identifier hhid:

        sort hhid
        preserve
        tempfile tmp
        bysort hhid: keep if _n == 1
        sample 10
        sort hhid
        save `tmp'
        restore
        merge m:1 hhid using `tmp'
        keep if _merge == 3
        drop _merge 

We preserve the dataset before keeping just one observation from each household, and then use sample to select an approximate 10% sample. We save this sample to a temporary file. We then restore the original dataset and merge with the saved dataset. The part of the dataset that we want is indicated by _merge == 3. A successful merge depends on a previous sort of both datasets.

If you want to take a sample that is not a particular percentage of the dataset but rather has an exact sample size, use sample, count.

Look, no file choreography

We now show another solution, all in place with no pas de deux of dancing files and without using sample. Knowing how to do it using basic principles may appeal to you.

First, keep a copy of the sort order:

        gen long order = _n

Then select one observation from each household:

        egen select = tag(hhid)

Now produce some random numbers and sort:

        gen rnd = runiform()
        sort select rnd 

One observation per household has now been sorted to the end, and those observations have been shuffled on the fly, courtesy of the random numbers. Suppose you want 10 of 100 households:

        replace select = _n > (_N - 10)

The indicator select is now 1 for the last 10 observations and 0 otherwise. Now we spread the word of being selected among the household members:

        bysort hhid (select): replace select = select[_N]

Finally, go back to the original sort order, and clean up:

        sort order
        drop order rnd 

This variation keeps both the selected sample, for which select == 1, and the other observations, for which select == 0. If you wanted the sample observations, then drop if !select or keep if select.

To learn more

In addition to the usual online help or manual entries, see FAQ: "How can I take random samples from an existing dataset?" for a discussion of sampling individuals.

Reference

Weesie, J. 1997.
dm46: Enhancement to the sample command. Stata Technical Bulletin 37: 6–7. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 37–38.