Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)

From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)
Date   Sun, 7 Aug 2011 10:32:36 -0400

Sorry, I misunderstood. Here's code that you can adapt. Note that you set the sample size you want in the first line

*************CODE BEGINS*************
scalar sampsize = 500
set seed 842655
/* Input Frequencies for External Population 
You can get these from -contract- 
in the original external data set:
"contract agegp region, freq(freq1)" 
input agegp region freq1
1 1 501
1 2 415
2 1 1809
2  2 3003
3  1 1288
3  2 1400
egen tot1 = total(freq1)
gen ssize = round(sampsize*freq1/tot1)
/* Check Frequencies */
tab agegp region [fw=freq1], cell
tab agegp region [fw=ssize], cell

sort agegp region
tempfile t1
save `t1'
/*  Create Data set to be sampled from the auto data */

sysuse auto, clear
expand 100
rename rep78 agegp
rename foreign region

recode agegp 2=1 5=1 .=1 3=2 4=3  // values 1,2,3
replace region = region +1        // values 1,2

/* Merge with external counts */
sort agegp region
merge m:1 agegp region using `t1'
tab _merge
drop _merge

egen stratum = group(agegp region)
levelsof stratum, local(levels)
tempfile t2
save `t2'
foreach x of local levels{
use `t2'
keep if stratum==`x'
gen u = uniform()
sort u
keep if _n<=ssize
tempfile td`x'
save `td`x''

tempfile t0 //empty data set to append to
gen dummy=1
save `t0'  
foreach x of local levels{
append using `td`x''
drop dummy
/* Check frequencies again */
tab agegp region , cell missing
save sample1, replace
**************CODE ENDS**************

On Aug 7, 2011, at 5:05 AM, Ekaterina Hertog wrote:

Dear Steven,
thank you for your help, however it does not fully solve my problem. Your proposed solution will allow me to roughly preserve the population percentages from the whole sample into a subsample. What I need however, is to impose populations percentages found in a different dataset on a subsample I am creating. Essentially i have two datasets: one of high income women and one of middle income women. High income women tend to be older and are more likely to live in the capital. I need to create a subsample of a dataset of middle income woemn which would match the high income women dataset on age and location characteristics.
Does anyone know how to do this in Stata 11?

On 07/08/2011 09:08, Steven Samuels wrote:
> The following code shows how to take a 10% sample within categories formed by two variables. The sample and whole population percentages will be approximately the same, with the agreement better for larger within-cell sample sizes.
> Steve
> *************CODE BEGINS*************
> sysuse auto, clear
> expand 6
> set seed 842655
> recode rep78 1/2=5 .=5
> tab rep78 foreign, cell
> sample 10, by(foreign rep78)
> tab rep78 foreign, cell
> **************CODE ENDS**************
> On Aug 6, 2011, at 4:23 PM, Ekaterina Hertog wrote:
> Dear all,
> I need to take a subsample of observations from a big dataset making sure that the people in the subsample have a given geographic and age profile. I need to make sure that, say, 50% of people in the subsample come from the capital and 50% from other towns. Within each of these 2 locations I want to preserve a certain age structure: say in a city: 3 people ages 23, 4 people aged 24 …
> Within those geographic and age profiles I want to select the observations randomly. Is it possible to do that in Stata 11? Any thoughts on how I would go about it?
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index