Stata | FAQ: Sampling clusters, not individuals

Home / Resources & support / FAQs / Sampling clusters, not individuals

How can I sample clusters, not individuals?

Title		Sampling clusters, not individuals
Authors		Nicholas J. Cox, Durham University, UK Scott Merryman, Risk Management Agency/USDA

Introduction

Often you need to sample clusters, not individuals. Suppose you have a dataset with individual people from several households, but you wish to sample households randomly, not individuals. Here are two ways to do so. Much of what we do here is also feasible through sample2 (Weesie 1997).

In each selection, clusters are chosen on random numbers produced by using runiform(). If you are serious about replicating your research, you will need to set your seed (see set seed) before generating the random numbers.

Merge solution

One way to accomplish our goal would be to keep one observation from each household, randomly sample from the remaining observations, and then merge back to the original dataset.

For example, assume a household identifier hhid:

        sort hhid
        preserve
        tempfile tmp
        bysort hhid: keep if _n == 1
        sample 10
        sort hhid
        save `tmp'
        restore
        merge m:1 hhid using `tmp'
        keep if _merge == 3
        drop _merge

We preserve the dataset before keeping just one observation from each household, and then use sample to select an approximate 10% sample. We save this sample to a temporary file. We then restore the original dataset and merge with the saved dataset. The part of the dataset that we want is indicated by _merge == 3. A successful merge depends on a previous sort of both datasets.

If you want to take a sample that is not a particular percentage of the dataset but rather has an exact sample size, use sample, count.

Look, no file choreography

We now show another solution, all in place with no pas de deux of dancing files and without using sample. Knowing how to do it using basic principles may appeal to you.

First, keep a copy of the sort order:

        gen long order = _n

Then select one observation from each household:

        egen select = tag(hhid)

Now produce some random numbers and sort:

        gen rnd = runiform()
        sort select rnd

One observation per household has now been sorted to the end, and those observations have been shuffled on the fly, courtesy of the random numbers. Suppose you want 10 of 100 households:

        replace select = _n > (_N - 10)

The indicator select is now 1 for the last 10 observations and 0 otherwise. Now we spread the word of being selected among the household members:

        bysort hhid (select): replace select = select[_N]

Finally, go back to the original sort order, and clean up:

        sort order
        drop order rnd

This variation keeps both the selected sample, for which select == 1, and the other observations, for which select == 0. If you wanted the sample observations, then drop if !select or keep if select.

To learn more

In addition to the usual online help or manual entries, see FAQ: "How can I take random samples from an existing dataset?" for a discussion of sampling individuals.

Reference

Weesie, J. 1997.: dm46: Enhancement to the sample command. Stata Technical Bulletin 37: 6–7. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 37–38.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How can I sample clusters, not individuals?

Introduction

Merge solution

Look, no file choreography

To learn more

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How can I sample clusters, not individuals?

Introduction

Merge solution

Look, no file choreography

To learn more

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies