Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Steven Samuels <sjsamuels@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population) |

Date |
Mon, 15 Aug 2011 17:13:44 -0400 |

You are very welcome, Ekaterina, but note Austin's follow-up. If you wish to make these two groups comparable for analysis, there are better, more comprehensive, and more defensible approaches. Austin mentioned propensity score weighting, but -margins- might be superior if you are not interested in causal effects of income. Steve On Aug 15, 2011, at 4:12 PM, Ekaterina Hertog wrote: Thank you very much! Sorry for posing the original question imprecisely! ekaterina On 07/08/2011 18:32, Steven Samuels wrote: > > Sorry, I misunderstood. Here's code that you can adapt. Note that you set the sample size you want in the first line > > *************CODE BEGINS************* > scalar sampsize = 500 > set seed 842655 > > clear > /* Input Frequencies for External Population > You can get these from -contract- > in the original external data set: > "contract agegp region, freq(freq1)" > */ > input agegp region freq1 > 1 1 501 > 1 2 415 > 2 1 1809 > 2 2 3003 > 3 1 1288 > 3 2 1400 > end > egen tot1 = total(freq1) > gen ssize = round(sampsize*freq1/tot1) > /* Check Frequencies */ > tab agegp region [fw=freq1], cell > tab agegp region [fw=ssize], cell > > sort agegp region > tempfile t1 > save `t1' > /* Create Data set to be sampled from the auto data */ > > sysuse auto, clear > expand 100 > rename rep78 agegp > rename foreign region > > recode agegp 2=1 5=1 .=1 3=2 4=3 // values 1,2,3 > replace region = region +1 // values 1,2 > > > /* Merge with external counts */ > sort agegp region > merge m:1 agegp region using `t1' > tab _merge > drop _merge > > egen stratum = group(agegp region) > levelsof stratum, local(levels) > tempfile t2 > save `t2' > foreach x of local levels{ > use `t2' > keep if stratum==`x' > gen u = uniform() > sort u > keep if _n<=ssize > tempfile td`x' > save `td`x'' > } > > clear > tempfile t0 //empty data set to append to > gen dummy=1 > save `t0' > foreach x of local levels{ > append using `td`x'' > } > drop dummy > /* Check frequencies again */ > tab agegp region , cell missing > save sample1, replace > **************CODE ENDS************** > > On Aug 7, 2011, at 5:05 AM, Ekaterina Hertog wrote: > > Dear Steven, > thank you for your help, however it does not fully solve my problem. Your proposed solution will allow me to roughly preserve the population percentages from the whole sample into a subsample. What I need however, is to impose populations percentages found in a different dataset on a subsample I am creating. Essentially i have two datasets: one of high income women and one of middle income women. High income women tend to be older and are more likely to live in the capital. I need to create a subsample of a dataset of middle income woemn which would match the high income women dataset on age and location characteristics. > Does anyone know how to do this in Stata 11? > Ekaterina > > On 07/08/2011 09:08, Steven Samuels wrote: >> The following code shows how to take a 10% sample within categories formed by two variables. The sample and whole population percentages will be approximately the same, with the agreement better for larger within-cell sample sizes. >> >> Steve >> >> *************CODE BEGINS************* >> sysuse auto, clear >> expand 6 >> set seed 842655 >> recode rep78 1/2=5 .=5 >> tab rep78 foreign, cell >> sample 10, by(foreign rep78) >> tab rep78 foreign, cell >> **************CODE ENDS************** >> >> >> >> On Aug 6, 2011, at 4:23 PM, Ekaterina Hertog wrote: >> >> Dear all, >> I need to take a subsample of observations from a big dataset making sure that the people in the subsample have a given geographic and age profile. I need to make sure that, say, 50% of people in the subsample come from the capital and 50% from other towns. Within each of these 2 locations I want to preserve a certain age structure: say in a city: 3 people ages 23, 4 people aged 24 … >> Within those geographic and age profiles I want to select the observations randomly. Is it possible to do that in Stata 11? Any thoughts on how I would go about it? >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: semi-random sampling***From:*Ekaterina Hertog <ekaterina.hertog@sociology.ox.ac.uk>

**Re: st: semi-random sampling***From:*Steven Samuels <sjsamuels@gmail.com>

**Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)***From:*Ekaterina Hertog <ekaterina.hertog@sociology.ox.ac.uk>

**Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)***From:*Steven Samuels <sjsamuels@gmail.com>

*From:*Ekaterina Hertog <ekaterina.hertog@sociology.ox.ac.uk>

- Prev by Date:
**Re: st: Order of variables in stcox model** - Next by Date:
**st: Equality of dependent intra-class correlations** - Previous by thread:
- Next by thread:
**st: date in stata** - Index(es):